[PPT] - Computational Optimization Last of unconstrained 2/26 Half-way PowerPoint Presentation

SLIDE 1

Computational Optimization

Last of unconstrained 2/26

SLIDE 2

Half-way there

Minimize f(x) (objective function) subject to x∈ S (constraints) Can characterize problem by type of

bjective functions and constraints

NEOS Optimization Guide http://www- fp.mcs.anl.gov/otc/Guide/OptWeb/

SLIDE 3

Optimization Recipes

Optimization algorithms are like recipes using common ingredients:

Step-size Trust regions Newton’s method Quasi-Newton Conjugate directions ….

Just stir up the right combination

SLIDE 4

Some other ingredients

Trust Region Methods Limited Memory Quasi Newton Linear Least Square Nonlinear Least Squares Finite difference methods Automatic Differentiation

SLIDE 5

Trust Region Methods

Alternative to line search methods Optimize quadratic model of objective within the “trust region”

) (x f ∇ −

i

x

1 i

x +

k k i i k

p t s p B p p x f x f p Δ ≤ + ∇ + ∈ . . ' 2 1 )' ( ) ( min arg

SLIDE 6

Options

How to pick Bk -- Newton or Quasi-Newton How to pick trust region radius --

Shrink if fail to get a decrease Increase if you get a good decrease Otherwise keep the same

Trust region problem need not be solved exactly. Many variations

k

Δ

SLIDE 7

Use ratio to determine trust region radius

1 arg min ( ) : ( ) ( )' ' 2 . .

k k i i k p k

p m p f x f x p p B p s t p ∈ = + ∇ + ≤ Δ ( ) ( ) (0) ( )

k k k k k k k

f x f x p m m p ρ − + = −

Look at ratio of actual versus predicted decrease. If ratio is near one and then increase radius. If ration is near zero then decrease radius.

k

p = Δ

SLIDE 8

Trust Region Methods

Pros: Pick direction and stepsize simultaneously Global convergence Superlinear convergence in many cases Some types very effective in practice. Cons: Must solve one or more constrained trust region problems at each iteration

SLIDE 9

BFGS in NW

Hk has nice low rank structure and all we really need to do is multiply it times the gradient. So we can do these without explicitly storing it.

1 ' ' 1 ' ' 1 1

1 ( )

x k k k k k k k k k k k k k k k k k k k k k k k

x x H f where H V H V s s with V I s y y s s x x y f f α ρ ρ ρ

+ + + +

= − ∇ = + = = − = − = ∇ − ∇

SLIDE 10

Hk grows at each iteration

( ) ( ) ( ) ( ) ( ) ( )

' ' 1 1 ' ' ' 1 1 1 1 ' ' ' 1 1 2 1 1 2 1 ' 1 1 1

... ... ... ... ... ... ......

k

k k k m k m k k m k k m k m k m k m k k m k k m k m k m k m k k k k

H V V H V V V V s s V V V V s s V V s s ρ ρ ρ

− − − − − − − + − − − + − − + − − + − + − + − + − − − −

= + + +

So can define a recursive procedure (see Algorithm 9page 225) that only requires inner products (assuming H0 diagonal) Uses only 4mn multiplications. Requires only storage of sk, yk

SLIDE 11

More improvements

H0 can be changed at each iteration. A good choice in practice is: Limit memory: only store sk,yk for last m iterates and base approximation on that.

1 ' 1 1 ' 1 − − − −

= =

k k k k k k k

y y y s I H γ γ

SLIDE 12

Limited Memory BFGS –pros and cons

Usually best algorithm for large problems with non-sparse Hessians May not be best if problem has special structure, e.g. sparsity, separable structure, nonlinear least squares. Needs Wolfe stepsize. Relatively cheap iterates Robust May converge slowly on highly ill conditioned problems.

SLIDE 13

Partially Separable Structure

Examples

) , ( ) , ( ) , ( ) (

5 4 3 4 1 2 3 1 1

x x f x x f x x f x f + + =

1

( ) ( )

m i i

f x f x

=

=∑

SLIDE 14

Predict Drug Bioavailability

Aqua solubility = Aquasol 525 descriptors generated

Electronic TAE Traditional

197 molecules with tested solubility

y R ∈

525 i

R ∈ x

197 =

SLIDE 15

1- d Regression with bias

, b + w x

x

b=2

y

SLIDE 16

Linear Regression

Given training data: Construct linear function: Goal for future data (x,y) with y unknown

( ) ( ) ( ) ( )

( )

1 1 2 2

, , , , , , , , , points and labels

i n i

S y y y y R y R = ∈ ∈

i i

x x x x x

…

…

1

( ) , ' ( )

n i i i

g b b w x b

=

= + = + = +

∑

x x w x w ( ) g y ≈ x

SLIDE 17

Least Squares Approximation

Want Define error Minimize loss

( ) g x y ≈

( , ) ( ) f y y g ξ = − = x x

( )

2 1

( , ) ( )

i i i

L g s y g

=

= −

∑

x

SLIDE 18

Linear Least Squares Loss

2 1 2

( , ) ( ' ) 2 ( )'(

i i i

b y b norm b b

=

= + − = + − − ) L b = + − + −

∑

w x w Xw e y Xw e y Xw e y

SLIDE 19

Optimal Solution

Want: Mathematical Model: Optimality Conditions:

2 2

min ( , , ) ( ) || || L b S b w λ = − + +

w

w y Xw e

b is a vector of ones ≈ + y Xw e e

( , , ) 2 '( ) 2 L b S b λ ∂ = − − + = ∂ w X y Xw e w w ( , , ) 2 '( ) L b S e b b ∂ = − − = ∂ w y Xw e

SLIDE 20

Optimal Solution

Thus : Assume data scaled such that mean(x)=X’e =0

1 1

( ' ) ' ' ( ' ) ' ( ) b b mean λ λ

− −

+ = − = + = X X I w X y X e w X X I X y y ' ' ' ' ' ( ) ( )' b b mean mean = − ⇒ = − = − e e e y e Xw e y e Xw y X w

SLIDE 21

Nonlinear Least squares

Partially separable problem

2 4 6 8 10 12 14 16 18 20

10

10 20 30 40 50

( )

2 1

1 ( , , ) ( , , ) 2

i i

f a b c f a b c

=

= ∑

2

( , , ) ( )

i i i i

f a b c y ax bx c = − + +

SLIDE 22

Nonlinear Least squares

Partially separable problem

( )

2 1

1 ( , , ) ( , , ) 2

i i

f a b c f a b c

=

= ∑

2

( , , ) ( )

i i i i

f a b c y ax bx c = − + +

2

( , , ) 1

i i i

x f a b c x ⎡ ⎤ ⎢ ⎥ ∇ = − ⎢ ⎥ ⎢ ⎥ ⎣ ⎦

( )

1 2 2 1

( , , ) ( , , )' ( , , ) ( ) 1

i i i i i i i i i

f a b c f a b c f a b c x y ax bx c x

= =

∇ = ∇ ⎡ ⎤ ⎢ ⎥ = − − + + ⎢ ⎥ ⎢ ⎥ ⎣ ⎦

∑ ∑

SLIDE 23

Nonlinear Least Squares

Problems of type: Newton = Gauss-Newton Newton + trust region = Levenberg-Marquardt

2 2

1 min ( ) ( ) 2 ( ) ( ) ( ) ( )' ( ) ( ) ( ) ( )'

x i i i i i i i i i i i i i i

f x f x Gradient is f x f x Hessian is f x f x f x f x Approximate Hessian by f x f x = ∇ ∇ ∇ + ∇ ∇ ∇

∑ ∑ ∑ ∑ ∑

SLIDE 24

Matlab

Check out Matlab opimization Type bandem Help fminunc Has all the basics:

SLIDE 25

What if gradient not available?

Can use finite difference methods Recall Approximate using small h

( ) ( ) '( ) lim

h

f x h f x f x h

→

+ − =

1 1 1

( ) ( ) ( ) [1, 0,.., 0]' f x he f x f x where e x h + − ∂ ≈ = ( ) ( ) '( ) f x h f x f x h + − ≈

SLIDE 26

Problems

Introduces error Best value of h is very small. Close to machine precision. Have to do for each dimension

( ) ( ) 2 f x h f x h h + − − ( ) ( ) f x h f x h + −

SLIDE 27

Automatic Differentiation

function makegradient(fcn, name) %Creates a new matlab function (defined by gradient(fcn)) %and saves it with the specified name % Example: makegradient('x^2+y^2','gf') creates a file gf.m: % function functout = gf(v) % x = v(1); % y = v(2); % functout = [2*x, 2*y]; %

SLIDE 28

Automatic Differentiation

function makehessian(fcn, name) %Creates a new matlab function (defined by gradient(fcn)) %and saves it with the specified name % % Example: makehessian('x^7+x*y^3','hf') creates a file hf.m: % % function functout = hf(v) % x = v(1); % y = v(2); % functout = [[42*x^5, 3*y^2]; [3*y^2, 6*x*y]]; %

SLIDE 29

Fminunc

First try using finite difference approximation for the gradient For example on L X0=[1,1]’; Options = optimset(‘display’,’iter’); X=fminunc(@L,X0,Options)

SLIDE 30

To use real gradient-Option 1

Combine f,h,g into on matlab file

function [f,g,H] = matL(x) f = L(x); if nargout > 1 g = gradL(x); end; if nargout > 2 H = hessL(x); end;

Options = optimset(‘gradobj’,’on’,’Display’,’iter’);

X=fminunc(@matL,X0,Options)

SLIDE 31

To use real gradient

Options =

ptimset(‘gradobj’,’on’,’Display’,’iter’);

X=fminunc({@L,@gradL},X0,Options)

SLIDE 32

To use Hessian

Same as gradient but add Options =

ptimset(Options,‘hessian’,’on’);

X=fminunc(@matL,X0,Options) Or X=fminunc({@L,@gradL,@hessL},X0,O ptions)

Computational Optimization

Last of unconstrained 2/26

Half-way there

Minimize f(x) (objective function) subject to x∈ S (constraints) Can characterize problem by type of

NEOS Optimization Guide http://www- fp.mcs.anl.gov/otc/Guide/OptWeb/

Optimization Recipes

Optimization algorithms are like recipes using common ingredients:

Just stir up the right combination

Some other ingredients

Trust Region Methods Limited Memory Quasi Newton Linear Least Square Nonlinear Least Squares Finite difference methods Automatic Differentiation

Trust Region Methods

Alternative to line search methods Optimize quadratic model of objective within the “trust region”

Options

Use ratio to determine trust region radius

Trust Region Methods

BFGS in NW

Hk grows at each iteration

More improvements

H0 can be changed at each iteration. A good choice in practice is: Limit memory: only store sk,yk for last m iterates and base approximation on that.

Limited Memory BFGS –pros and cons

Partially Separable Structure

Examples

Predict Drug Bioavailability

Aqua solubility = Aquasol 525 descriptors generated

197 molecules with tested solubility

y R ∈

R ∈ x

197 =

1- d Regression with bias

, b + w x

x

Linear Regression

Given training data: Construct linear function: Goal for future data (x,y) with y unknown

∑

Least Squares Approximation

Want Define error Minimize loss

( ) g x y ≈

∑

Linear Least Squares Loss

( , ) ( ' ) 2 ( )'(

b y b norm b b

= + − = + − − ) L b = + − + −

∑

w x w Xw e y Xw e y Xw e y

Optimal Solution

Want: Mathematical Model: Optimality Conditions:

b is a vector of ones ≈ + y Xw e e

Optimal Solution

Thus : Assume data scaled such that mean(x)=X’e =0

Nonlinear Least squares

Partially separable problem

Nonlinear Least squares

Partially separable problem

∑ ∑

Nonlinear Least Squares

Matlab

Check out Matlab opimization Type bandem Help fminunc Has all the basics:

What if gradient not available?

Can use finite difference methods Recall Approximate using small h

Problems

Introduces error Best value of h is very small. Close to machine precision. Have to do for each dimension

Automatic Differentiation

Automatic Differentiation

Fminunc

First try using finite difference approximation for the gradient For example on L X0=[1,1]’; Options = optimset(‘display’,’iter’); X=fminunc(@L,X0,Options)

To use real gradient-Option 1

To use real gradient

To use Hessian

Same as gradient but add Options =

X=fminunc(@matL,X0,Options) Or X=fminunc({@L,@gradL,@hessL},X0,O ptions)

Check your gradients!

Try Options =

X=fminunc(@L,X0,Options) Try on the our family of functions. What happens?