[PPT] - Todays Discussion Conjugate gradient algorithm To date: 1. Choose PowerPoint Presentation

SLIDE 1

Today’s Discussion

To date:

Neural networks: what are they
Backpropagation: efficient gradient computation
Advanced training: conjugate gradient

Today:

CG postscript: scaled conjugate gradients
Adaptive architectures
My favorite neural network learning environment
Some applications

Conjugate gradient algorithm

1. Choose an initial weight vector

and let .

2. Perform a line minimization along

, such that: , .

3. Let

.

4. Evaluate

.

5. Let

where, (Polak-Ribiere)

6. Let

and go to step 2. w1 d1 g1 – = dj E wj α∗dj + ( ) E wj αdj + ( ) ≤ η ∀ wj

1 +

wj α∗dj + = gj

1 +

dj

1 +

gj

1 +

– βjdj + = βj gj

1 + T

gj

1 +

gj – ( ) gj

Tgj

=

j j 1 + =

Scaled conjugate gradient algorithm

Basic idea: Replace line minimization: , . with: Why #!@$ are we doing this? Didn’t we want to avoid computation of ? E wj α∗dj + ( ) E wj αdj + ( ) ≤ η ∀ αj dj

Tgj

– dj

THdj

=

H

Scaled conjugate gradient algorithm

Well, yes but:

Line minimization can be computationally expensive.
Don’t really have to compute

? Huh? H αj dj

Tgj

– dj

THdj

=

SLIDE 2

A closer look at

Don’t have to compute , only . Theorem: = current

dimensional weight vector,

= (gradient of at some vector ), and, = Hessian of evaluated at , = arbitrary

dimensional vector.

αj H Hdj w0 W g w ( ) E w ( ) ∇ E w H E w0 d W Hd g w0 εd + ( ) g w0 εd – ( ) – 2ε

ε

→

lim =

Computing

First-order Taylor expansion of about : Hdj Hd g w0 εd + ( ) g w0 εd – ( ) – 2ε

ε

→

lim = g w ( ) w0 g w ( ) g w0 ( ) H w w0 – ( ) + ≈ g w0 εd + ( ) g w0 εd – ( ) – 2ε

≈

g w0 ( ) H εd ( ) + [ ] g w0 ( ) H εd ( ) – [ ] – 2ε

Computing

So: now just requires two gradient evaluations... Hdj g w0 εd + ( ) g w0 εd – ( ) – 2ε

2εHd

2ε

≈

g w0 εd + ( ) g w0 εd – ( ) – 2ε

Hd

≈ Hd g w0 εd + ( ) g w0 εd – ( ) – 2ε

ε

→

lim = αj dj

Tgj

– dj

THdj

=

New conjugate gradient algorithm

1. Choose an initial weight vector

and let .

2. Compute

: , .

3. Let

.

4. Evaluate

.

5. Let

where,

6. Let

and go to step 2. Any problems? w1 d1 g1 – = αj αj dj

Tgj

– dj

THdj

⁄ = η ∀ wj

1 +

wj αjdj + = gj

1 +

dj

1 +

gj

1 +

– βjdj + = βj gj

1 + T

gj

1 +

gj – ( ) gj

Tgj

⁄ = j j 1 + =

SLIDE 3

What about ?

might take uphill steps... Idea:

Replace

with

So:

What the #$@! is this? H < αj dj

Tgj

– dj

THdj

⁄ = H H λI + αj dj

Tgj

– dj

THdj

λ dj

2

+

=

Examining

What is the meaning of

being very large?

What is the meaning of

being very small (i.e. zero)? λ αj dj

Tgj

– dj

THdj

λ dj

2

+

=

λ λ

Model trust regions

Question: When should we “trust” ? αj dj

Tgj

– dj

THdj

=

Model trust regions

Question: When should we “trust” ?

1. H is positive definite (denominator > 0)
2. Local quadratic assumption is good

αj dj

Tgj

– dj

THdj

=

SLIDE 4

Near a mountain, not a valley

Look at denominator of: If , increase to make denominator positive. αj dj

Tgj

– dj

THdj

λ dj

2

+

=

δ dj

THdj

λ dj

2

+ = δ < λ

How to increase ?

How about: so that: λ λ' 2 λ δ dj

2

–

    = δ' δ λ' λ – ( ) dj

2

+ = δ' δ 2 λ δ dj

2

–

    λ – dj

2

+ = δ' δ 2δ – λ dj

2

+ δ – λ dj

2

+ = =

New effective denominator value

So: (what does this mean?) λ' 2 λ δ dj

2

–

    = δ' δ – λ dj

2

+ = δ' dj

THdj

λ dj

2

+ ( ) – λ dj

2

+ = δ' dj

THdj

– =

Goin’ up? I’ll show you...

Since the new denominator is: the new value of is: δ' dj

THdj

– = αj αj' dj

Tgj

– dj

THdj

–

dj

Tgj

dj

THdj

=

= H < H > αj αj αj' αj' αj – =

SLIDE 5

Model trust regions

Question: When should we “trust” ?

1. H is positive definite (denominator > 0)
2. Local quadratic assumption is good

αj dj

Tgj

– dj

THdj

=

How to test local quadratic assumption?

Check: What’s ? So: What does tell us? ∆ E wj ( ) E wj αjdj + ( ) – E wj ( ) EQ wj αjdj + ( ) –

=

EQ EQ w ( ) E w0 ( ) w w0 – ( )Tb 1 2

w

w0 – ( )TH w w0 – ( ) + + = EQ wj αjdj + ( ) E wj ( ) αjdj

Tgj

1 2

αj

2dj THdj

+ + = ∆

Local quadratic test

Adjustment of trust region:

If

then decrease (e.g. )

If

then increase (e.g. )

Otherwise, leave

unchanged ∆ E wj ( ) E wj αjdj + ( ) – E wj ( ) EQ wj αjdj + ( ) –

=

∆ 0.75 > λ λ λ 2 ⁄ = ∆ 0.25 < λ λ 4λ = λ

Scaled conjugate gradient algorithm ( )

1. Compute

.

2. If

, set .

3. Compute

.

4. Compute

: 5.

6. If

, set , else if , set . αj λ , δ dj

THdj

λ dj

2

+ = δ < λ 2 λ δ dj

2

⁄ – ( ) = αj dj

Tgj

dj

THdj

λ dj

2

+ ( ) ⁄ – = ∆ ∆ E wj ( ) E wj αjdj + ( ) – E wj ( ) EQ wj αjdj + ( ) –

=

∆ 0.75 > λ λ 2 ⁄ = ∆ 0.25 < λ 4λ =

SLIDE 6

Scaled conjugate gradient algorithm

1. Choose an initial weight vector

and let .

2. Compute

: , .

3. Let

.

4. Evaluate

.

5. Let

where,

6. Let

and go to step 2. w1 d1 g1 – = αj λ , αj dj

Tgj

– dj

THdj

λ dj

2

+

=

η ∀ wj

1 +

wj αjdj + = gj

1 +

dj

1 +

gj

1 +

– βjdj + = βj gj

1 + T

gj

1 +

gj – ( ) gj

Tgj

⁄ = j j 1 + =

Today’s Discussion

To date:

Neural networks: what are they
Backpropagation: efficient gradient computation
Advanced training: conjugate gradient

Today:

CG postscript: scaled conjugate gradients
Adaptive architectures
My favorite neural network learning environment
Some applications

Adaptive architectures

Standard learning:

Select neural network architecture
Train neural network
If failure, go back to first step

Better approach:

Adapt neural network architecture as function of training

Adaptive architectures

Standard learning: Adaptive approach: training training

SLIDE 7

Adaptive architectures

Problem: How do we do this? Two main approaches:

Pruning (destructive algorithms)
Growing (constructive algorithms)

Pruning algorithms

Basic idea:

Start with really “big” network
Eliminate “unimportant” weights/nodes
Retrain neural network

Advantages? Disadvantages? Problems?

Pruning algorithms

Basic idea:

Start with really “big” network
Eliminate “unimportant” weights/nodes
Retrain neural network

Advantages? (smaller final architectures) Disadvantages? (training cost of large network, retraining) Problems? (what is “unimportant?”)

Weight elimination schemes

Idea: eliminate weights based on “saliency.” Definition: saliency = relative importance of weight Any suggestions? Si ωi

SLIDE 8

Saliency

First guess: Will this work? Si ωi =

Why won’t this measure of saliency work

x1 x1 z 2 1 1 1 – 0.01 0.01

A better idea for saliency

Try to find relationship: How can we do this?

Brute force:

(problems?) δE δw = δEi E w ( ) E w δwi + ( ) – = δwi 0 … 0 ωi – 0 … 0 , , , , , , [ ] =

More on saliency

Use ol’ reliable: 2nd order Taylor approximation Now: (what are and ?) Can we simply this? E w ( ) E w0 ( ) w w0 – ( )T E w0 ( ) ∇ 1 2

w

w0 – ( )TH w w0 – ( ) + + = δw w1 w0 – = w1 w0 δE E w1 ( ) E w0 ( ) – δwT E w0 ( ) ∇ 1 2

δwTHδw

+ = =

SLIDE 9

Optimal Brain Damage

1. Idea: assume Hessian is diagonal
2. Resulting saliency:
3. Eliminate weights with smallest saliency
4. Retrain remaining weights

δE 1 2

δwTHδw

= δE 1 2

Hiiδωi

2 i

∑

= Si Hiiωi

2

=

Optimal Brain Surgery

Smarter idea: don’t assume Hessian is diagonal
Eliminate need for retraining

Now, assume you want to remove weight : We want to minimize, subject to constraint (why?) ωi δE 1 2

δwTHδw

= δωi ωi – =

Optimal Brain Surgery

Use Lagrange multipliers:

We can minimize

subject to constraint by minimizing

= Lagrange multiplier

For our case: f x ( ) g x ( ) = L f x ( ) λg x ( ) + = λ f x ( ) δE 1 2

δwTHδw

= = g x ( ) δωi ωi + =

Optimal Brain Surgery

Minimize: ... Solution: (what’s the problem?) L 1 2

δwTHδw

λ δωi ωi + ( ) + = δw ωi H 1

–

[ ]ii

H 1

– ui

– = δEi 1 2

ωi

2

H 1

–

[ ]ii

=

SLIDE 10

Optimal Brain Surgery

1. Evaluate the inverse Hessian

.

2. Evaluate:
3. Eliminate weight

, , .

4. Update all weights (no retraining)

H 1

–

δEi 1 2

ωi

2

H 1

–

[ ]ii

=

ωi δEi δEj < i j ≠ δw ωi H 1

–

[ ]ii

H 1

– ui

– =

Node elimination scheme

Idea: Node pruning — need saliency of node, not weight Define: (output of unit with addition of ) Then: zj γ αj ωijzi

i

∑

    = j αj sj E αj 1 = ( ) E αj = ( ) – = sj ∂E ∂αj ⁄

j 1 =

≈

Pruning algorithms: key issues

Large network to small network
Need definition of saliency
May need retraining step

Big problem: lots of wasted training effort

Growing algorithms

Basic idea:

Start with really small network
Add hidden units as required

Advantages? Disadvantages? Problems?

SLIDE 11

Growing algorithms

Basic idea:

Start with really small network
Add hidden units as required

Advantages? (reduced training cost, optimized networks) Disadvantages? (?) Problems? (arrangement of added weights/nodes)

Cascade growing: initial network Cascade growing: first hidden unit training Cascade growing: second hidden unit frozen

SLIDE 12

Cascade growing: alternative visualization Cascade growing: alternative visualization Cascade growing: alternative visualization Cascade neural networks

Do you ever need deeply nested structure?

SLIDE 13

Two-spiral problem: best fixed architecture

input layer (2 inputs) first hidden layer (5 hidden units) second hidden layer (5 hidden units) third hidden layer (5 hidden units)

Two-spiral problem: cascade architecture Two-spiral problem: cascade architecture Today’s Discussion

To date:

Neural networks: what are they
Backpropagation: efficient gradient computation
Advanced training: conjugate gradient

Today:

CG postscript: scaled conjugate gradients
Adaptive architectures
My favorite neural network learning environment
Some applications

SLIDE 14

NN environment that rocks...

Two problems with traditional neural networks:

Fixed architecture
Difficult to guess “appropriate” architecture
Functional complexity requirements can vary widely
Slow learning algorithms (e.g. backprop, quickprop)

My neural network approach:

Flexible architecture
Cascade neural networks
Variable activation functions
Fast learning algorithm (e.g. NDEKF)

Cascade neural networks with node- decoupled extended Kalman filtering (NDEKF)

Types of problems investigated:

Continuous function approximation
Dynamic system modeling

Cascade learning and NDEKF combine to result in better error convergence.

A sneak peak at results

Fixed architecture Quickprop Cascade learning Quickprop Fixed architecture NDEKF Cascade learning NDEKF Performance (Error convergence)

Additional flexibility: variable activations

Cascade neural networks already offer great flexibility... However, why restrict candidate activation functions?

Sigmoidal activation functions may not offer best results.
Sinusoidals and/or others may be more appropriate:

1 e x

–

– 1 e x

–

+

1 2

⁄ – x2 ( ) exp x ( ) sin 2x ( ) sin J0 x ( ) J1 x ( )

SLIDE 15

Additional flexibility: variable activations

For continuous mapping problems:

Variable networks converge to better minima.
Sinusoidal networks — about same as variable networks.

Better learning: extended Kalman filtering

View neural network training problem as system identification problem.

Let weights of neural network represent state of nonlinear

dynamic system.

Let neural network be that nonlinear system.

Extended Kalman filter training:

Advantage: Explicitly accounts for pairwise

interdependence of weights with conditional error covariance matrix.

Disadvantage:

computational complexity, where W is number of weights in network. O W2 ( )

Decoupled extended Kalman filtering

Key insight:

Some weights are more interdependent than others.
Group weights into groups.
Ignore interdependence between groups of weights (block

diagonalize conditional error covariance matrix). Even better idea: Group weights by node!

Node-decoupled extended Kalman filtering

Key insight: Decouple (group) weights by node: Natural formulation for cascade learning

One weight group for current hidden unit
One additional weight group for each output unit
Matrix operations reduce to vector operations.
Computational complexity reduces to

. O Wi

2 i

∑

   

SLIDE 16

Computational complexity

NDEKF requires inversion of an matrix, ( = number of outputs) Cascade learning with NDEKF typically requires less than 10 epochs/hidden unit.

Several orders of magnitude less than backprop or

quickprop approaches.

Computational complexity similar to fixed-architecture

networks trained with NDEKF. m m × m

Computational complexity

Ratio of computational cost between a cascade/NDEKF epoch and an equivalent fixed-architecture/backprop epoch (for few outputs):

Example: for 400 inputs and 20 hidden units ratio is less

than 100.

Example: for 20 or less inputs, ratio is less than 10.

Experimental studies

Four learning approaches: Symbol Explanation Fq fixed-architecture training with quickprop Cq cascade-network training with quickprop Fk fixed-architecture training with NDEKF Ck cascade-network training with NDEKF

Experimental studies

Key questions:

Do we improve learning using NDEKF by going from

fixed-architecture networks to cascade-type learning?

Do we improve cascade learning by switching from

quickprop (simple training) to NDEKF?

Are any of more advanced methods (Cq, Fk, Ck) an

improvement over baseline Fq (fixed-architecture/ quickprop) training method?

SLIDE 17

Five learning problems

Problem (A): smooth, continuous FA Problem (B): nonsmooth, continuous FA

f1 x y z , , ( ) z πy ( ) sin x + = f2 x y z , , ( ) z2 πxy ( ) cos y2 – + =

5 10 15 20

0.4
0.2

0.2 0.4

f x ( ) x

Five learning problems

Problem (C): deterministic dynamic system Problems (D) & (E): chaotic Mackey-Glass dynamic system (t+6) and (t+84) u k 1 + ( ) f u k ( ) u k 1 – ( ) u k 2 – ( ) x k ( ) x k 1 – ( ) , , , , [ ] = f x1 x2 x3 x4 x5 , , , , [ ] x1x2x3x5 x3 1 – ( ) x4 + 1 x3

2

x2

2

+ +

=

0.4 0.6 0.8 1 1.2

0.06
0.04
0.02

0.02 0.04 0.06

x ˙ t ( ) x t ( )

Learning results (avg. RMS error)

Ck Fk Cq Fq (A) 42.1 (4.2) 127.1 (37.3) 94.5 (6.2) N/A (B) 7.4 (2.0) 12.4 (3.2) 14.5 (4.0) 65.0 (18.2) (C) 15.6 (1.5) 20.7 (4.8) 29.9 (2.0) N/A (D) 4.6 (0.6) 10.2 (4.0) 9.4 (2.7) 16.7 (2.2) (E) 42.0 (5.9) 60.5 (3.1) 72.6 (16.3) 90.3 (8.3)

Learning results

50 100 150 200 250

(A) (B) (C) (D) (E) % N/A N/A 778%

% difference in RMS error between cascade/NDEKF and fixed-architecture/NDEKF % difference in RMS error between cascade/NDEKF and cascade/quickprop % difference in RMS error between cascade/NDEKF and fixed-architecture/quickprop

SLIDE 18

Why is Ck better than Fk?

“NDEKF at times requires a small amount of redundancy in network in terms of total number of nodes in order to avoid poor local minima...” — [Puskorius & Feldkamp, 1991]

0.05 0.1 0.15 0.2 0.25

eRMS Cq Fk sn ( ) Ck Fk sg ( )

bad local minima Problem (A)

Why is Ck better than Cq?

As hidden units are added in cascade learning, NDEKF is better equipped to handle increasingly correlated weights to new hidden units.

2 4 6 8 10 12 14 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 0.01 0.02 0.03 0.04 0.05 0.06

# hidden units

Cq Ck

# hidden units

Problem (C)

Cascade/NDEKF advantages/disadvantages

Cascade learning and NDEKF complement each other well.
Cascade learning minimizes the potentially detrimental

effect of node-decoupling.

Cascade learning minimizes the problem of poor local

minima in NDEKF.

NDEKF better handles the increased correlation of weights

as the number of hidden units increases in cascade learning.

NDEKF requires no learning parameter tuning.
Cascade/NDEKF converges efficiently to better local

minima than either cascade or NDEKF by themselves.

Disadvantage: computationally efficient with few outputs.

Today’s Discussion

To date:

Neural networks: what are they
Backpropagation: efficient gradient computation
Advanced training: conjugate gradient

Today:

CG postscript: scaled conjugate gradients
Adaptive architectures
My favorite neural network learning environment
Some applications