Computational Optimization Last of unconstrained 2/26 Half-way - - PowerPoint PPT Presentation

computational optimization
SMART_READER_LITE
LIVE PREVIEW

Computational Optimization Last of unconstrained 2/26 Half-way - - PowerPoint PPT Presentation

Computational Optimization Last of unconstrained 2/26 Half-way there Minimize f(x) (objective function) subject to x S (constraints) Can characterize problem by type of objective functions and constraints NEOS Optimization Guide


slide-1
SLIDE 1

Computational Optimization

Last of unconstrained 2/26

slide-2
SLIDE 2

Half-way there

Minimize f(x) (objective function) subject to x∈ S (constraints) Can characterize problem by type of

  • bjective functions and constraints

NEOS Optimization Guide http://www- fp.mcs.anl.gov/otc/Guide/OptWeb/

slide-3
SLIDE 3

Optimization Recipes

Optimization algorithms are like recipes using common ingredients:

Step-size Trust regions Newton’s method Quasi-Newton Conjugate directions ….

Just stir up the right combination

slide-4
SLIDE 4

Some other ingredients

Trust Region Methods Limited Memory Quasi Newton Linear Least Square Nonlinear Least Squares Finite difference methods Automatic Differentiation

slide-5
SLIDE 5

Trust Region Methods

Alternative to line search methods Optimize quadratic model of objective within the “trust region”

) (x f ∇ −

i

x

1 i

x +

k k i i k

p t s p B p p x f x f p Δ ≤ + ∇ + ∈ . . ' 2 1 )' ( ) ( min arg

slide-6
SLIDE 6

Options

How to pick Bk -- Newton or Quasi-Newton How to pick trust region radius --

Shrink if fail to get a decrease Increase if you get a good decrease Otherwise keep the same

Trust region problem need not be solved exactly. Many variations

k

Δ

slide-7
SLIDE 7

Use ratio to determine trust region radius

1 arg min ( ) : ( ) ( )' ' 2 . .

k k i i k p k

p m p f x f x p p B p s t p ∈ = + ∇ + ≤ Δ ( ) ( ) (0) ( )

k k k k k k k

f x f x p m m p ρ − + = −

Look at ratio of actual versus predicted decrease. If ratio is near one and then increase radius. If ration is near zero then decrease radius.

k

p = Δ

slide-8
SLIDE 8

Trust Region Methods

Pros: Pick direction and stepsize simultaneously Global convergence Superlinear convergence in many cases Some types very effective in practice. Cons: Must solve one or more constrained trust region problems at each iteration

slide-9
SLIDE 9

BFGS in NW

Hk has nice low rank structure and all we really need to do is multiply it times the gradient. So we can do these without explicitly storing it.

1 ' ' 1 ' ' 1 1

1 ( )

x k k k k k k k k k k k k k k k k k k k k k k k

x x H f where H V H V s s with V I s y y s s x x y f f α ρ ρ ρ

+ + + +

= − ∇ = + = = − = − = ∇ − ∇

slide-10
SLIDE 10

Hk grows at each iteration

( ) ( ) ( ) ( ) ( ) ( )

' ' 1 1 ' ' ' 1 1 1 1 ' ' ' 1 1 2 1 1 2 1 ' 1 1 1

... ... ... ... ... ... ......

k

k k k m k m k k m k k m k m k m k m k k m k k m k m k m k m k k k k

H V V H V V V V s s V V V V s s V V s s ρ ρ ρ

− − − − − − − + − − − + − − + − − + − + − + − + − − − −

= + + +

So can define a recursive procedure (see Algorithm 9page 225) that only requires inner products (assuming H0 diagonal) Uses only 4mn multiplications. Requires only storage of sk, yk

slide-11
SLIDE 11

More improvements

H0 can be changed at each iteration. A good choice in practice is: Limit memory: only store sk,yk for last m iterates and base approximation on that.

1 ' 1 1 ' 1 − − − −

= =

k k k k k k k

y y y s I H γ γ

slide-12
SLIDE 12

Limited Memory BFGS –pros and cons

Usually best algorithm for large problems with non-sparse Hessians May not be best if problem has special structure, e.g. sparsity, separable structure, nonlinear least squares. Needs Wolfe stepsize. Relatively cheap iterates Robust May converge slowly on highly ill conditioned problems.

slide-13
SLIDE 13

Partially Separable Structure

Examples

) , ( ) , ( ) , ( ) (

5 4 3 4 1 2 3 1 1

x x f x x f x x f x f + + =

1

( ) ( )

m i i

f x f x

=

=∑

slide-14
SLIDE 14

Predict Drug Bioavailability

Aqua solubility = Aquasol 525 descriptors generated

Electronic TAE Traditional

197 molecules with tested solubility

y R ∈

525 i

R ∈ x

197 =

slide-15
SLIDE 15

1- d Regression with bias

, b + w x

x

b=2

y

slide-16
SLIDE 16

Linear Regression

Given training data: Construct linear function: Goal for future data (x,y) with y unknown

( ) ( ) ( ) ( )

( )

1 1 2 2

, , , , , , , , , points and labels

i n i

S y y y y R y R = ∈ ∈

i i

x x x x x

1

( ) , ' ( )

n i i i

g b b w x b

=

= + = + = +

x x w x w ( ) g y ≈ x

slide-17
SLIDE 17

Least Squares Approximation

Want Define error Minimize loss

( ) g x y ≈

( , ) ( ) f y y g ξ = − = x x

( )

2 1

( , ) ( )

i i i

L g s y g

=

= −

x

slide-18
SLIDE 18

Linear Least Squares Loss

2 1 2

( , ) ( ' ) 2 ( )'(

i i i

b y b norm b b

=

= + − = + − − ) L b = + − + −

w x w Xw e y Xw e y Xw e y

slide-19
SLIDE 19

Optimal Solution

Want: Mathematical Model: Optimality Conditions:

2 2

min ( , , ) ( ) || || L b S b w λ = − + +

w

w y Xw e

b is a vector of ones ≈ + y Xw e e

( , , ) 2 '( ) 2 L b S b λ ∂ = − − + = ∂ w X y Xw e w w ( , , ) 2 '( ) L b S e b b ∂ = − − = ∂ w y Xw e

slide-20
SLIDE 20

Optimal Solution

Thus : Assume data scaled such that mean(x)=X’e =0

1 1

( ' ) ' ' ( ' ) ' ( ) b b mean λ λ

− −

+ = − = + = X X I w X y X e w X X I X y y ' ' ' ' ' ( ) ( )' b b mean mean = − ⇒ = − = − e e e y e Xw e y e Xw y X w

slide-21
SLIDE 21

Nonlinear Least squares

Partially separable problem

2 4 6 8 10 12 14 16 18 20

  • 10

10 20 30 40 50

( )

2 1

1 ( , , ) ( , , ) 2

i i

f a b c f a b c

=

= ∑

  • 2

( , , ) ( )

i i i i

f a b c y ax bx c = − + +

slide-22
SLIDE 22

Nonlinear Least squares

Partially separable problem

( )

2 1

1 ( , , ) ( , , ) 2

i i

f a b c f a b c

=

= ∑

  • 2

( , , ) ( )

i i i i

f a b c y ax bx c = − + +

2

( , , ) 1

i i i

x f a b c x ⎡ ⎤ ⎢ ⎥ ∇ = − ⎢ ⎥ ⎢ ⎥ ⎣ ⎦

( )

( )

1 2 2 1

( , , ) ( , , )' ( , , ) ( ) 1

i i i i i i i i i

f a b c f a b c f a b c x y ax bx c x

= =

∇ = ∇ ⎡ ⎤ ⎢ ⎥ = − − + + ⎢ ⎥ ⎢ ⎥ ⎣ ⎦

∑ ∑

slide-23
SLIDE 23

Nonlinear Least Squares

Problems of type: Newton = Gauss-Newton Newton + trust region = Levenberg-Marquardt

2 2

1 min ( ) ( ) 2 ( ) ( ) ( ) ( )' ( ) ( ) ( ) ( )'

x i i i i i i i i i i i i i i

f x f x Gradient is f x f x Hessian is f x f x f x f x Approximate Hessian by f x f x = ∇ ∇ ∇ + ∇ ∇ ∇

∑ ∑ ∑ ∑ ∑

slide-24
SLIDE 24

Matlab

Check out Matlab opimization Type bandem Help fminunc Has all the basics:

slide-25
SLIDE 25

What if gradient not available?

Can use finite difference methods Recall Approximate using small h

( ) ( ) '( ) lim

h

f x h f x f x h

+ − =

1 1 1

( ) ( ) ( ) [1, 0,.., 0]' f x he f x f x where e x h + − ∂ ≈ = ( ) ( ) '( ) f x h f x f x h + − ≈

slide-26
SLIDE 26

Problems

Introduces error Best value of h is very small. Close to machine precision. Have to do for each dimension

( ) ( ) 2 f x h f x h h + − − ( ) ( ) f x h f x h + −

slide-27
SLIDE 27

Automatic Differentiation

function makegradient(fcn, name) %Creates a new matlab function (defined by gradient(fcn)) %and saves it with the specified name % Example: makegradient('x^2+y^2','gf') creates a file gf.m: % function functout = gf(v) % x = v(1); % y = v(2); % functout = [2*x, 2*y]; %

slide-28
SLIDE 28

Automatic Differentiation

function makehessian(fcn, name) %Creates a new matlab function (defined by gradient(fcn)) %and saves it with the specified name % % Example: makehessian('x^7+x*y^3','hf') creates a file hf.m: % % function functout = hf(v) % x = v(1); % y = v(2); % functout = [[42*x^5, 3*y^2]; [3*y^2, 6*x*y]]; %

slide-29
SLIDE 29

Fminunc

First try using finite difference approximation for the gradient For example on L X0=[1,1]’; Options = optimset(‘display’,’iter’); X=fminunc(@L,X0,Options)

slide-30
SLIDE 30

To use real gradient-Option 1

Combine f,h,g into on matlab file

function [f,g,H] = matL(x) f = L(x); if nargout > 1 g = gradL(x); end; if nargout > 2 H = hessL(x); end;

Options = optimset(‘gradobj’,’on’,’Display’,’iter’);

X=fminunc(@matL,X0,Options)

slide-31
SLIDE 31

To use real gradient

Options =

  • ptimset(‘gradobj’,’on’,’Display’,’iter’);

X=fminunc({@L,@gradL},X0,Options)

slide-32
SLIDE 32

To use Hessian

Same as gradient but add Options =

  • ptimset(Options,‘hessian’,’on’);

X=fminunc(@matL,X0,Options) Or X=fminunc({@L,@gradL,@hessL},X0,O ptions)

slide-33
SLIDE 33

Check your gradients!

Try Options =

  • ptimset(Options,‘DerivativeCheck’,’on’)

X=fminunc(@L,X0,Options) Try on the our family of functions. What happens?