Computational Optimization Last of unconstrained 2/26 Half-way - - PowerPoint PPT Presentation
Computational Optimization Last of unconstrained 2/26 Half-way - - PowerPoint PPT Presentation
Computational Optimization Last of unconstrained 2/26 Half-way there Minimize f(x) (objective function) subject to x S (constraints) Can characterize problem by type of objective functions and constraints NEOS Optimization Guide
Half-way there
Minimize f(x) (objective function) subject to x∈ S (constraints) Can characterize problem by type of
- bjective functions and constraints
NEOS Optimization Guide http://www- fp.mcs.anl.gov/otc/Guide/OptWeb/
Optimization Recipes
Optimization algorithms are like recipes using common ingredients:
Step-size Trust regions Newton’s method Quasi-Newton Conjugate directions ….
Just stir up the right combination
Some other ingredients
Trust Region Methods Limited Memory Quasi Newton Linear Least Square Nonlinear Least Squares Finite difference methods Automatic Differentiation
Trust Region Methods
Alternative to line search methods Optimize quadratic model of objective within the “trust region”
) (x f ∇ −
i
x
1 i
x +
k k i i k
p t s p B p p x f x f p Δ ≤ + ∇ + ∈ . . ' 2 1 )' ( ) ( min arg
Options
How to pick Bk -- Newton or Quasi-Newton How to pick trust region radius --
Shrink if fail to get a decrease Increase if you get a good decrease Otherwise keep the same
Trust region problem need not be solved exactly. Many variations
k
Δ
Use ratio to determine trust region radius
1 arg min ( ) : ( ) ( )' ' 2 . .
k k i i k p k
p m p f x f x p p B p s t p ∈ = + ∇ + ≤ Δ ( ) ( ) (0) ( )
k k k k k k k
f x f x p m m p ρ − + = −
Look at ratio of actual versus predicted decrease. If ratio is near one and then increase radius. If ration is near zero then decrease radius.
k
p = Δ
Trust Region Methods
Pros: Pick direction and stepsize simultaneously Global convergence Superlinear convergence in many cases Some types very effective in practice. Cons: Must solve one or more constrained trust region problems at each iteration
BFGS in NW
Hk has nice low rank structure and all we really need to do is multiply it times the gradient. So we can do these without explicitly storing it.
1 ' ' 1 ' ' 1 1
1 ( )
x k k k k k k k k k k k k k k k k k k k k k k k
x x H f where H V H V s s with V I s y y s s x x y f f α ρ ρ ρ
+ + + +
= − ∇ = + = = − = − = ∇ − ∇
Hk grows at each iteration
( ) ( ) ( ) ( ) ( ) ( )
' ' 1 1 ' ' ' 1 1 1 1 ' ' ' 1 1 2 1 1 2 1 ' 1 1 1
... ... ... ... ... ... ......
k
k k k m k m k k m k k m k m k m k m k k m k k m k m k m k m k k k k
H V V H V V V V s s V V V V s s V V s s ρ ρ ρ
− − − − − − − + − − − + − − + − − + − + − + − + − − − −
= + + +
So can define a recursive procedure (see Algorithm 9page 225) that only requires inner products (assuming H0 diagonal) Uses only 4mn multiplications. Requires only storage of sk, yk
More improvements
H0 can be changed at each iteration. A good choice in practice is: Limit memory: only store sk,yk for last m iterates and base approximation on that.
1 ' 1 1 ' 1 − − − −
= =
k k k k k k k
y y y s I H γ γ
Limited Memory BFGS –pros and cons
Usually best algorithm for large problems with non-sparse Hessians May not be best if problem has special structure, e.g. sparsity, separable structure, nonlinear least squares. Needs Wolfe stepsize. Relatively cheap iterates Robust May converge slowly on highly ill conditioned problems.
Partially Separable Structure
Examples
) , ( ) , ( ) , ( ) (
5 4 3 4 1 2 3 1 1
x x f x x f x x f x f + + =
1
( ) ( )
m i i
f x f x
=
=∑
Predict Drug Bioavailability
Aqua solubility = Aquasol 525 descriptors generated
Electronic TAE Traditional
197 molecules with tested solubility
y R ∈
525 i
R ∈ x
197 =
1- d Regression with bias
, b + w x
x
b=2
y
Linear Regression
Given training data: Construct linear function: Goal for future data (x,y) with y unknown
( ) ( ) ( ) ( )
( )
1 1 2 2
, , , , , , , , , points and labels
i n i
S y y y y R y R = ∈ ∈
i i
x x x x x
- …
…
1
( ) , ' ( )
n i i i
g b b w x b
=
= + = + = +
∑
x x w x w ( ) g y ≈ x
Least Squares Approximation
Want Define error Minimize loss
( ) g x y ≈
( , ) ( ) f y y g ξ = − = x x
( )
2 1
( , ) ( )
i i i
L g s y g
=
= −
∑
x
Linear Least Squares Loss
2 1 2
( , ) ( ' ) 2 ( )'(
i i i
b y b norm b b
=
= + − = + − − ) L b = + − + −
∑
w x w Xw e y Xw e y Xw e y
Optimal Solution
Want: Mathematical Model: Optimality Conditions:
2 2
min ( , , ) ( ) || || L b S b w λ = − + +
w
w y Xw e
b is a vector of ones ≈ + y Xw e e
( , , ) 2 '( ) 2 L b S b λ ∂ = − − + = ∂ w X y Xw e w w ( , , ) 2 '( ) L b S e b b ∂ = − − = ∂ w y Xw e
Optimal Solution
Thus : Assume data scaled such that mean(x)=X’e =0
1 1
( ' ) ' ' ( ' ) ' ( ) b b mean λ λ
− −
+ = − = + = X X I w X y X e w X X I X y y ' ' ' ' ' ( ) ( )' b b mean mean = − ⇒ = − = − e e e y e Xw e y e Xw y X w
Nonlinear Least squares
Partially separable problem
2 4 6 8 10 12 14 16 18 20
- 10
10 20 30 40 50
( )
2 1
1 ( , , ) ( , , ) 2
i i
f a b c f a b c
=
= ∑
- 2
( , , ) ( )
i i i i
f a b c y ax bx c = − + +
Nonlinear Least squares
Partially separable problem
( )
2 1
1 ( , , ) ( , , ) 2
i i
f a b c f a b c
=
= ∑
- 2
( , , ) ( )
i i i i
f a b c y ax bx c = − + +
2
( , , ) 1
i i i
x f a b c x ⎡ ⎤ ⎢ ⎥ ∇ = − ⎢ ⎥ ⎢ ⎥ ⎣ ⎦
( )
( )
1 2 2 1
( , , ) ( , , )' ( , , ) ( ) 1
i i i i i i i i i
f a b c f a b c f a b c x y ax bx c x
= =
∇ = ∇ ⎡ ⎤ ⎢ ⎥ = − − + + ⎢ ⎥ ⎢ ⎥ ⎣ ⎦
∑ ∑
Nonlinear Least Squares
Problems of type: Newton = Gauss-Newton Newton + trust region = Levenberg-Marquardt
2 2
1 min ( ) ( ) 2 ( ) ( ) ( ) ( )' ( ) ( ) ( ) ( )'
x i i i i i i i i i i i i i i
f x f x Gradient is f x f x Hessian is f x f x f x f x Approximate Hessian by f x f x = ∇ ∇ ∇ + ∇ ∇ ∇
∑ ∑ ∑ ∑ ∑
Matlab
Check out Matlab opimization Type bandem Help fminunc Has all the basics:
What if gradient not available?
Can use finite difference methods Recall Approximate using small h
( ) ( ) '( ) lim
h
f x h f x f x h
→
+ − =
1 1 1
( ) ( ) ( ) [1, 0,.., 0]' f x he f x f x where e x h + − ∂ ≈ = ( ) ( ) '( ) f x h f x f x h + − ≈
Problems
Introduces error Best value of h is very small. Close to machine precision. Have to do for each dimension
( ) ( ) 2 f x h f x h h + − − ( ) ( ) f x h f x h + −
Automatic Differentiation
function makegradient(fcn, name) %Creates a new matlab function (defined by gradient(fcn)) %and saves it with the specified name % Example: makegradient('x^2+y^2','gf') creates a file gf.m: % function functout = gf(v) % x = v(1); % y = v(2); % functout = [2*x, 2*y]; %
Automatic Differentiation
function makehessian(fcn, name) %Creates a new matlab function (defined by gradient(fcn)) %and saves it with the specified name % % Example: makehessian('x^7+x*y^3','hf') creates a file hf.m: % % function functout = hf(v) % x = v(1); % y = v(2); % functout = [[42*x^5, 3*y^2]; [3*y^2, 6*x*y]]; %
Fminunc
First try using finite difference approximation for the gradient For example on L X0=[1,1]’; Options = optimset(‘display’,’iter’); X=fminunc(@L,X0,Options)
To use real gradient-Option 1
Combine f,h,g into on matlab file
function [f,g,H] = matL(x) f = L(x); if nargout > 1 g = gradL(x); end; if nargout > 2 H = hessL(x); end;
Options = optimset(‘gradobj’,’on’,’Display’,’iter’);
X=fminunc(@matL,X0,Options)
To use real gradient
Options =
- ptimset(‘gradobj’,’on’,’Display’,’iter’);
X=fminunc({@L,@gradL},X0,Options)
To use Hessian
Same as gradient but add Options =
- ptimset(Options,‘hessian’,’on’);
X=fminunc(@matL,X0,Options) Or X=fminunc({@L,@gradL,@hessL},X0,O ptions)
Check your gradients!
Try Options =
- ptimset(Options,‘DerivativeCheck’,’on’)