Optimization Unconstrained optimization Constrained optimization - - PowerPoint PPT Presentation
Optimization Unconstrained optimization Constrained optimization - - PowerPoint PPT Presentation
Optimization Unconstrained optimization Constrained optimization Newton with equality One-dimensional constraints Multi-dimensional Active-set method Newtons method Descent methods Basic Newton Gradient Simplex method Gauss-
Constrained optimization
Newton with equality constraints Simplex method Interior-point method Active-set method
Unconstrained optimization One-dimensional Multi-dimensional
Newton’s method Descent methods Gradient descent Conjugate gradient Basic Newton Gauss- Newton Quasi- Newton
Unconstrained optimization
- Define an objective function over a domain: f: Rn → R
- Optimization variables:
minimize f(x1, x2, · · · , xn) xT = {x1, x2, · · · , xn} minimize f(x), for x ∈ Rn
Constraints
- Equality constraints
- Inequality constraints
ai(x) = 0 for x ∈ Rn, where i = 1, · · · , p cj(x) ≥ 0 for x ∈ Rn, where j = 1, · · · , q
Constrained optimization
- Solution: x* satisfies constraints ai and cj, while minimizing
the objective function f(x)
minimize f(x), for x ∈ Rn subjec to ai(x) = 0, where i = 1, · · · , p cj(x) ≥ 0, where j = 1, · · · , q
Formulate an optimization
- General optimization problem is very difficult to solve
- Certain problem classes can be solve efficiently and reliably
- Convex problems can be solved with global solutions
efficiently and reliably
- Nonconvex problems do not guarantee global solutions
Example: pattern matching
- A pattern can be described by a set of points, P = {p1, p2, ...,
pn}
- The same object viewed from a different distance or a different
angle corresponds to a different P’
- Two patterns P and P’ are similar if
p
i = η
- cos θ
− sin θ sin θ cos θ ⇥ pi +
- r1
r2 ⇥
Example: pattern matching
- Let Q = {q1, q2, ..., qn} be the target pattern, find the most
similar pattern among P1, P2, ..., Pn
Inverse kinematics
a set of 3D marker positions a pose described by joint angles
Optimal motion trajectories
Quiz
Arrive at d with velocity = 0 Maximal force allowed: F
d
Minimize time? Minimize energy?
- Unconstrained optimization
- Newton method
- Gauss-Newton method
- Gradient descent method
- Conjugate gradient method
Newton method
C(x) = 0
Find the roots of of a nonlinear function
¯ x = x − C(x) C′(x)
Then we can estimate the roots as We can linearize the function as
C(¯ x) = C(x) + C′(x)(¯ x − x) = 0
, where C′(x) = ∂C
∂x
Root estimation
C(x)
x x(0)
C(x(1)) = C(x(0)) + C′(x(0))(x(1) − x(0))
x(1)
x(2)
Minimization
What is the simplest function that has minima? Find the minima of F(x) Find such that the nonlinear function is a minimum
x∗
F(x∗)
F(x(k) + δ) = F(x(k)) + F ′(x(k))δ + 1 2F ′′(x(k))δ2
Find the roots of F ′(x)
δ = − F ′(x) F ′′(x) ∂F(x(k) + δ) ∂δ = 0
Conditions
- What are the conditions for minima to exist?
- Necessary conditions: a local minimum
exists at x*
- Sufficient conditions: an isolated minimum
exists x*
F ′(x∗) = 0 F ′′(x∗) ≥ 0 F ′′(x∗) > 0 F ′(x∗) = 0
Minimization
x
F(x) F ′(x) x∗
F ′′(x∗) > 0
Multidimensional optimization
- Search methods only need function evaluations
- First-order gradient-based methods depend on
the information of gradient g
- Second-order gradient-based methods depend on
both gradient and Hessian H
Multiple variables
F(x(k) + p) = F(x(k)) + gT (x(k))p + 1 2pT H(x(k))p g(x) = ∇xF =
∂F ∂x1
. . .
∂F ∂xn
gradient vector
H(x) = ∇2
xxF =
∂2F ∂x2
1
· · ·
∂2F ∂x1∂xn ∂2F ∂x2∂x1
· · ·
∂2F ∂x2∂xn
. . . . . .
∂2F ∂xn∂x1
· · ·
∂2F ∂x2
n
Hessian matrix
Multiple variables
0 = g(x(k)) + H(x(k))p p = −H(x(k))−1g(x(k)) x(k+1) = x(k) + p
Multiple variables
Necessary conditions: Sufficient conditions:
g(x∗) = 0 pT H∗p ≥ 0 pT H∗p > 0 g(x∗) = 0
H is positive semi-definite H is positive definite
Gauss-Newton method
- What if the objective function is in the form of a
vector of functions?
- The real-valued function can be formed as
f = [f1(x) f2(x) · · · fm(x)]T F =
m
- p=1
fp(x)2 = f T f
Jacobian
- Each fp(x) depends on xi for i = 1,2,...,m, a gradient
matrix can be formed
- The Jacobian need not to be a square matrix
Gradient and Hessian
- Gradient of objective function
- Hessian of objective function
∂F ∂xi =
m
- p=1
2fp(x)∂fp ∂xi gF = 2JT f ∂2F ∂xi∂xj = 2
m
- p=1
∂fp ∂xi ∂fp ∂xj + 2
m
- p=1
fp(x) ∂2fp ∂xi∂xj HF ≈ 2JT J
Gauss-Newton algorithm
- In kth iteration, compute fp(xk) and Jk to obtain
new gk and Hk
- Compute pk = -(2JTJ)-1(2JTf) = -(JTJ)-1(JTf)
- Find αk that minimizes F(xk + αkpk)
- Set xk+1 = xk + αkpk
- First-order gradient methods
- Greatest gradient descent
- Conjugate gradient
Solving large linear system
Ax = b A a known, square, symmetric, and
positive semi-definite matrix
b a known vector x an unknown vector
If A is dense, solve with factorization and back substitution If A is sparse, solve with iterative methods (descent methods)
Quadratic form
The critical point of F is also the solution to Ax = b
F ′(x) = 0 = Ax − b
The gradient of F(x) is
F(x) = 1 2xT Ax − bT x + c F ′(x) = 1 2AT x + 1 2Ax − b
If A is symmetric, F ′(x) = Ax − b If A is not symmetric, what is the linear system solved by finding the critical points of F ?
Greatest gradient descent
Take a step along the direction in which F descents most quickly
−F ′(x(k)) = b − Ax(k)
Start at an arbitrary point x(0) and slide down to the bottom of the paraboloid Take a series of steps x(1), x(2), ... until we are satisfied that we are close enough to the solution x*
Greatest gradient descent
error: e(k) = x(k) − x∗ Think residual as the direction of the greatest descent Important definitions: residual: r(k) = b − Ax(k) = −F ′(x(k))
= −Ae(k)
Line search
r(0)
x(0)
x(1) = x(0) + αr(0)
But how big of a step should we take? A line search is a procedure that chooses to minimize F along a line
α
Line search
- 2.5 0
2.5 5
- 5
- 2.5
2.5 50 100 150
- 2.5 0
2.5 5
1 1 2 1
0.2 0.4 0.6 20 40 60 80 100 120 140
Optimal step size
d dαF(x(1)) = F ′(x(1))T d dαx(1) = F ′(x(1))T r(0) = 0
r(0) F ′(x(1))
rT
(0)r(1) = 0
Optimal step size
Exercise: derive alpha from Hint: replace the terms involving (k+1) with those involving (k) by x(k+1) = x(k) + αr(k)
rT
(k)r(k+1) = 0
α = rT
k rk
rT
k Ark
Ans:
Recurrence of residual
The algorithm requires two matrix-vector multiplications per iteration
r(k) = b − Ax(k)
1.
α = rT
k rk
rT
k Ark
2.
x(k+1) = x(k) + αr(k)
3. One multiplication can be eliminated by replacing step 1 with
r(k+1) = r(k) − αAr(k)
Quiz
- In our IK problem, we use greatest gradient
descent method to find an optimal pose, but we can’t compute alpha using the formula described in the previous slides, why?
Line search
- Exact line search: Choose t to minimize f along the ray
- Backtracking line search depends on two constants: α and β
given a descent direction ∆x for f at x ∈ dom f, α ∈ (0,0.5), β ∈ (0,1) t := 1 while f(x + t∆x) > f(x) + αt∇f(x)T ∆x, t := βt
{x + t∆x | t ≥ 0} t = argmins≥0f(x + s∆x)
Poor convergence
Wouldn’t it be nice if we can avoid to traverse the same direction? What is the problem with greatest descent?
Conjugate directions
Pick a set of directions:
d(0), d(1), · · · , d(n−1)
Take exactly one step along each direction Solution is found within n steps Two problems:
- 1. How do we determine these directions?
- 2. How do we determine the step size along each direction?
A-orthogonality
If we take the optimal step size along each direction
d dαF(x(k+1))
= F ′(x(k+1))T d
dαx(k+1)
= −rT
(k+1)d(k)
= dT
(k)Ae(k+1)
=
Two different vectors v and u are A-orthogonal or conjugate, if vTAu = 0
A-orthogonality
vectors are A-orthogonal vectors are orthogonal
Optimal size step
must be A-orthogonal to
e(k+1) d(k)
Using this condition, can you derive ?
α(k)
Algorithm
α(k) = − dT
(k)r(k)
dT
(k)Ad(k)
Take
x(k+1) = x(k) + α(k)d(k) d(k)
1. 2. 3. Suppose we can come up with a set of A-orthogonal directions , this algorithm will converge in n steps
{d(k)}
We need to prove that can be found in n steps if we take step size along at each step
Why does it work?
x∗
e(0) =
n−1
- i=0
δid(i) dT
(j)Ae(0) = n−1
- i=0
δidT
(j)Ad(i)
dT
(j)Ae(0) = δjdT (j)Ad(j)
α(k)
d(k)
= dT
(j)Ae(j)
dT
(j)Ad(j)
= −α(j)
d’s are linearly independent if d’s are A-orthogonal
δj = dT
(j)Ae(0)
dT
(j)Ad(j)
= dT
(j)A(e(0) − j−1 k=0 δkd(k))
dT
(j)Ad(j)
Quiz
- Given that d’s are A-orthogonal, prove that d’s
are linearly independent.
Search directions
- We know how to determine the optimal step size
along each direction (second problem solved)
- We still need to figure out what search directions
are
- What do we know about d(0), d(1), ..., d(n-1)?
- They are A-orthogonal to each other: d(i)TAd(j)
= 0
- d(i) is A-orthogonal to e(i+1)
Gram-Schmidt Conjugation
Use the same trick to get rid of the summation
βkj = −uT
k Ad(j)
dT
j Ad(j)
What are the drawbacks of Gram-Schmidt conjugation? Suppose we have a set of linearly independent vectors u’s, the search directions can be represented as
d(0) = u0
and
d(k) = uk +
k−1
- i=0
βkid(i) k > j
dT
(k)Ad(j) = uT (k)Ad(j) + βkjdT (j)Ad(j)
Conjugate gradients
- If we pick a set of u’s intelligently, we might be
able to save both time and space
- It turns out that residuals (r’s) is an excellent
choice for u’s
- residuals are orthogonal to each other
- residual is orthogonal to the previous search
directions
Conjugate gradients
d(k) = r(k) +
k−1
- i=0
βkid(i) βkj = − rT
(k)Ad(j)
dT
(j)Ad(j)
dT
(k)Ad(j) = rT (k)Ad(j) + k−1
- i=0
βkidT
(i)Ad(j)
j < k
0 = rT
(k)Ad(j) + βkjdT (j)Ad(j)
(by A-orthogonality of d vectors)
Each requires O(n3) operations! However...
d(k)
Conjugate gradients
is A-orthogonal to all the previous search directions except for d(k−1)
r(k)
βkj = − rT
(k)Ad(j)
dT
(j)Ad(j)
= 0 j < k − 1
if
βkj = − rT
(k)r(k)
rT
(k−1)r(k−1)
if
j = k − 1
proof:
rT
(k)Ad(j) = 0
j < k − 1
when
Proof: Orthogonality
Proof is orthogonal to all the previous search directions
r(k)
d(0), d(1), · · · , d(k−1) e(k) =
n−1
- j=k
δjd(j) dT
(i)Ae(k) = n−1
- j=k
δjdT
(i)Ad(j) = 0
dT
(i)r(k) = 0
if i < k From here, we can proof
dT
(k)r(k) = rT (k)r(k)
if i < k
rT
(i)r(j) = 0, i ̸= j
identity 1 identity 2
Proof: A-orthogonality
{
- therwise
Proof is A-orthogonal to all the previous search directions except for d(k−1)
r(k)
r(j+1) = −Ae(j+1) = −A(e(j) + α(j)d(j)) = r(j) − α(j)Ad(j) rT
(k)r(j+1) = rT (k)r(j) − α(j)rT (k)Ad(j)
rT
(k)Ad(j) =
− rT
(k)r(k)
α(k−1) rT
(k)r(k)
α(k) j = k − 1 j = k
use identity 1
Conjugate gradients
Simplify βk
use identity 2
= − rT
(k)r(k)
dT
(k−1)r(k−1)
= − rT
(k)r(k)
rT
(k−1)r(k−1)
βk = − rT
(k)Ad(k−1)
dT
(k−1)Ad(k−1)
=
rT
(k)r(k)
α(k−1)
dT
(k−1)Ad(k−1)
use orthogonality
= rT
(k)r(k)
dT
(k−1)(r(k−1) − r(k))
Conjugate gradients
Put it all together
d(0) = r(0) = b − Ax(0) α(k) = rT
(k)r(k)
dT
(k)Ad(k)
x(k+1) = x(k) + α(k)d(k) r(k+1) = r(k) − α(k)Ad(k) d(k+1) = r(k+1) + β(k+1)d(k)
β(k+1) = − rT
(k+1)r(k+1)
rT
(k)r(k)
References
- J. Shewchuk, An introduction to conjugate
gradient method without agonizing pain
- A. Antoniou and W.S. Lu, Practical optimization
- R. Fletcher, Practical methods of optimization
- J. Betts, Practical methods for optimal control
using nonlinear programming