[PPT] - Rapid Stochastic Gradient Descent Accelerating Machine Learning PowerPoint Presentation

SLIDE 1

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Rapid Stochastic Gradient Descent

Accelerating Machine Learning

Nicol N. Schraudolph

SLIDE 2

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

1. Why Stochastic Gradient?
2. Stochastic Meta-Descent (SMD)

Derivation and Algorithm Properties and Benchmark Results Applications and Ongoing Work

3. Summary and Outlook

2

Overview

SLIDE 3

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

The flood of information caused by

plentiful, affordable sensors (such as webcams) ever-increasing networking of these sensors

verwhelms our processing ability in, e.g.,

science - pulsar survey at Arecibo: 1 TB/day business - Dell website: over 100 page requests/sec security - London: over 500’000 security cameras

We need intelligent, adaptive filters to cope!

3

The Information Glut

SLIDE 4

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Coping with the info glut requires ML alg.s for

large, complex, nonlinear models millions of degrees of freedom large volumes of low-quality data noisy, correlated, non-stationary, outliers efficient real-time, online adaptation no fixed training set, life-long learning

Current ML techniques have difficulty with this.

4

A Challenge for ML

SLIDE 5

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

5

Online Learning Paradigm

classical optimization:

iterative optimizer

bjective fn.

training data set

nested loops!

nline learning:
nline optimizer

training data stream

...

(aka adaptive filtering, stochastic approximation, ...)

SLIDE 6

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Classical formulation of optimization problem:

inefficient for large data sets X inappropriate for never-ending, potentially non-stationary data streams

⇒ must resort to stochastic approximation:

6

Stochastic Approximation

θt+1 ≈ arg min θ J(θt, xt) (t = 0, 1, 2, . . .) θ∗ = arg min θ : Ex[J(θ, x)] ≈ 1 |X|

xi∈X

J(θ, xi)

SLIDE 7

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

accelerate convergence Kalman Filter

The Key Problem

7

cost per iteration

O(1) O(n) O(n )

2

O(n )

3

convergence speed evolutionary algorithms gradient descent conjugate gradient quasi-Newton Levenberg Marquardt

ptimization algorithms:
nline, scalable

SLIDE 8

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

The Key Problem

Stochastic approximation breaks many optimizers:

conjugate directions break down due to noise line minimizations (CG, quasi-Newton) inaccurate Newton, Levenberg-Marquardt, Kalman filter - too expensive for large-scale problems

This only leaves

evolutionary alg.s - very inefficient (don’t use gradient) simple gradient descent - can be slow to converge

8

SLIDE 9

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

9

Gain Vector Adaptation

Given stochastic gradient , adapt θ by gradient descent with gain vector η:

θt+1 = θt − ηt · gt gt := ∂θJ(θt, xt) Hadamard

(element-wise)

scalar meta-gain

(free parameter)

ηt = ηt−1 · exp(−µ ∂θJ(θt, xt) · ∂lnη θt) ≈ ηt−1 · max( 1

2 , 1 − µ gt · vt)

Key idea:

simultaneously adapt η by exponentiated gradient:

ln ηt = ln ηt−1 − µ ∂lnηJ(θt, xt)

SLIDE 10

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Single-Step Model

10

giving ηt = ηt−1 · max( 1

2 , 1 + µ ηt−1 · gt−1 · gt)

⇒ adaptation of η driven by autocorrelation of g:

Conventionally, (recall that ) θt+1 = θt − ηt · gt vt+1 := ∂lnηt θt+1 = − ηt · gt

SLIDE 11

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

SMD’s Multi-Step Model

11

vt+1 :=

t

i=0

λi ∂θt+1 ∂ ln ηt−i define decay 0≤λ≤1

(free parameter)

t0

p(t) p(t) w(t) w(t)

t0

θ η θ η

To capture long-term dependence of θ on η:

SLIDE 12

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

vt+1 :=

t

i=0

λi ∂θt+1 ∂ ln ηt−i vt+1 = λvt −

t

i=0

λi ∂(ηt · gt) ∂ ln ηt−i vt+1 =

t

i=0

λi ∂θt ∂ ln ηt−i −

t

i=0

λi ∂(ηt · gt) ∂ ln ηt−i vt+1 ≈ λvt − ηt ·

gt +

t

i=0

λi ∂gt ∂ ln ηt−i

vt+1 = λvt − ηt ·
gt +

t

i=0

λiHt ∂θt ∂ ln ηt−i

vt+1 = λvt −

t

i=0

λi ∂ηt · gt ∂ ln ηt−i −

t

i=0

λi ηt · ∂gt ∂ ln ηt−i vt+1 = λvt − ηt · (gt + λHtvt)

SMD’s v-update

12

we obtain a simple iterative update for v correct smoothing over correlated input signals involves implicit Hessian-vector (Hv) product

can be computed as efficiently as 2-3 gradient eval.s can be done automatically via algorithmic differentiation

SLIDE 13

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

v is too noisy to use directly; SMD achieves stability by means of the double integration v → η → θ v⋅g is well-behaved (self-normalizing property) SMD uses Gauss-Newton approximation of H

Fixpoint of v

13

Fixpoint of vt+1 = λvt − ηt · (gt + λHtvt) is a Levenberg-Marquardt style gradient step: v → −[λH + (1 − λ)diag(η)−1]−1g

SLIDE 14

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Four Regions Benchmark

Compare simple stochastic gradient (SGD), conventional gain vector adaptation (ALAP), stochastic meta-descent (SMD), and a global extended Kalman filter (GEKF).

14

x

y

SLIDE 15

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Benchmark: Convergence

15

loss patterns 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0k 5k 10k 15k 20k 25k SMD SGD ALAP GEKF

SLIDE 16

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Computational Cost

16

Algorithm storage weight flops update CPU ms pattern SGD 1 6 0.5 SMD 3 18 1.0 ALAP 4 18 1.0 GEKF >90 >1500 40

SLIDE 17

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Benchmark: CPU Usage

17

loss seconds 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 10 20 30 40 50 SMD SGD ALAP GEKF

SLIDE 18

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Autocorrelated Data

i.i.d. uniform Sobol Brownian

18

patterns E

SMD ELK1

vario-eta momentum

ALAP

patterns E

SMD ELK1 ALAP

vario-eta mom.

patterns E

SMD ELK1

mom.

s-ALAP ALAP

SLIDE 19

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Comparison to CG

19

verfits diverges converges

Conjugate Gradient SMD

deterministic stochastic stochastic (1000 pts) (1000 pts/iteration) (5 pts/iteration)

SLIDE 20

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Application: Turbulent Flow

20

(PhD thesis of M. Milano, Inst. of Computational Science, ETH Zürich)

linear PCA

(160 p.c.)

neural network

(160 nonlinear p.c.)

riginal flow

(75’000 d.o.f.)

SLIDE 21

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Turbulent Flow Model

Very high-dimensional optimization problem:

15 neural networks, each about 180’000 parameters the generic model has over 20 million parameters!

Here SMD:

utperformed

Matlab toolbox was able to train generic model

21

Learning Curves

bold driver SMD reconstruction error 3 iteration x 10 1e-02 1e-01 1e+00 1e+01 0.00 0.50 1.00 1.50 2.00

SLIDE 22

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Application: Hand Tracking

(PhD thesis of M. Bray, Computer Vision Lab, ETH Zürich)

detailed hand model (10k vertices, 26 d.o.f.) randomly sample a few points on model surface project them to image compare with camera image at these points SMD uses resulting stochastic gradient to adjust model

22

SLIDE 23

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Hand Tracking with SMD

23

state of the art: Annealed Particle Filter (114 sec/frame)

ur algorithm:

Stochastic Meta-Descent (3 sec/frame)

SLIDE 24

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Hand Tracking: Results

Through stochastic sampling, SMD achieves:

40x speedup over state of the art (3 vs. 114 s/frame) better tracking: noise helps escape local minima robustness wrt. clutter, shadows, occlusions, ... Ongoing work at NICTA: use multiple, ordinary (even cheap) video cameras simultaneous real-time tracking of hands, face & body

24

SLIDE 25

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

SMD for Online SVM

Online SVM aka NORMA (Kivinen, Smola, Williamson 2004):

nline kernel method

stochastic gradient in expansion coefficients employs scalar gain η Application of SMD: v is function in RKHS <g,v> can be maintained incrementally in O(n) NIPS’05 workshop (large-scale kernel machines)

25

SLIDE 26

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Work in Progress

More applications of SMD:

policy gradient reinforcement learning (NIPS’05) generalized Hebbian algorithm for Kernel PCA parameter estimation in conditional random fields

Also working on:

SMD convergence and stability analysis further refinement of the algorithm

ther ways to accelerate stochastic gradient

26

SLIDE 27

The imagination driving Australia’s ICT future.

Statistical Machine Learning Program www.nicta.com.au

Summary and Outlook

Summary:

data-rich ML problems need stochastic approximation classical gradient methods are not up to the task SMD: excellent gain adaptation for stochastic gradient (Hv product gives cheap second-order information)

Outlook:

increasing demand for stochastic gradient methods SMD can greatly accelerate stochastic gradient

27