The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
Rapid Stochastic Gradient Descent Accelerating Machine Learning - - PowerPoint PPT Presentation
The imagination driving Australia s ICT future. Nicol N. Schraudolph Rapid Stochastic Gradient Descent Accelerating Machine Learning Statistical Machine Learning Program www.nicta.com.au The imagination driving
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
Derivation and Algorithm Properties and Benchmark Results Applications and Ongoing Work
2
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
plentiful, affordable sensors (such as webcams) ever-increasing networking of these sensors
science - pulsar survey at Arecibo: 1 TB/day business - Dell website: over 100 page requests/sec security - London: over 500’000 security cameras
3
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
large, complex, nonlinear models millions of degrees of freedom large volumes of low-quality data noisy, correlated, non-stationary, outliers efficient real-time, online adaptation no fixed training set, life-long learning
4
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
5
iterative optimizer
training data set
training data stream
(aka adaptive filtering, stochastic approximation, ...)
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
inefficient for large data sets X inappropriate for never-ending, potentially non-stationary data streams
6
θt+1 ≈ arg min θ J(θt, xt) (t = 0, 1, 2, . . .) θ∗ = arg min θ : Ex[J(θ, x)] ≈ 1 |X|
J(θ, xi)
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
7
O(1) O(n) O(n )
2
O(n )
3
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
conjugate directions break down due to noise line minimizations (CG, quasi-Newton) inaccurate Newton, Levenberg-Marquardt, Kalman filter - too expensive for large-scale problems
evolutionary alg.s - very inefficient (don’t use gradient) simple gradient descent - can be slow to converge
8
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
9
θt+1 = θt − ηt · gt gt := ∂θJ(θt, xt) Hadamard
(element-wise)
(free parameter)
ηt = ηt−1 · exp(−µ ∂θJ(θt, xt) · ∂lnη θt) ≈ ηt−1 · max( 1
2 , 1 − µ gt · vt)
Key idea:
ln ηt = ln ηt−1 − µ ∂lnηJ(θt, xt)
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
10
2 , 1 + µ ηt−1 · gt−1 · gt)
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
11
vt+1 :=
t
λi ∂θt+1 ∂ ln ηt−i define decay 0≤λ≤1
(free parameter)
t0
p(t) p(t) w(t) w(t)
t0
θ η θ η
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
vt+1 :=
t
λi ∂θt+1 ∂ ln ηt−i vt+1 = λvt −
t
λi ∂(ηt · gt) ∂ ln ηt−i vt+1 =
t
λi ∂θt ∂ ln ηt−i −
t
λi ∂(ηt · gt) ∂ ln ηt−i vt+1 ≈ λvt − ηt ·
t
λi ∂gt ∂ ln ηt−i
t
λiHt ∂θt ∂ ln ηt−i
t
λi ∂ηt · gt ∂ ln ηt−i −
t
λi ηt · ∂gt ∂ ln ηt−i vt+1 = λvt − ηt · (gt + λHtvt)
12
we obtain a simple iterative update for v correct smoothing over correlated input signals involves implicit Hessian-vector (Hv) product
can be computed as efficiently as 2-3 gradient eval.s can be done automatically via algorithmic differentiation
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
v is too noisy to use directly; SMD achieves stability by means of the double integration v → η → θ v⋅g is well-behaved (self-normalizing property) SMD uses Gauss-Newton approximation of H
13
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
Compare simple stochastic gradient (SGD), conventional gain vector adaptation (ALAP), stochastic meta-descent (SMD), and a global extended Kalman filter (GEKF).
14
x
y
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
15
loss patterns 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0k 5k 10k 15k 20k 25k SMD SGD ALAP GEKF
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
16
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
17
loss seconds 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 10 20 30 40 50 SMD SGD ALAP GEKF
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
i.i.d. uniform Sobol Brownian
18
patterns E
SMD ELK1
vario-eta momentum
ALAP
patterns E
SMD ELK1 ALAP
vario-eta mom.
patterns E
SMD ELK1
mom.
s-ALAP ALAP
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
19
deterministic stochastic stochastic (1000 pts) (1000 pts/iteration) (5 pts/iteration)
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
20
(PhD thesis of M. Milano, Inst. of Computational Science, ETH Zürich)
linear PCA
(160 p.c.)
neural network
(160 nonlinear p.c.)
(75’000 d.o.f.)
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
15 neural networks, each about 180’000 parameters the generic model has over 20 million parameters!
Matlab toolbox was able to train generic model
21
Learning Curves
bold driver SMD reconstruction error 3 iteration x 10 1e-02 1e-01 1e+00 1e+01 0.00 0.50 1.00 1.50 2.00
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
(PhD thesis of M. Bray, Computer Vision Lab, ETH Zürich)
detailed hand model (10k vertices, 26 d.o.f.) randomly sample a few points on model surface project them to image compare with camera image at these points SMD uses resulting stochastic gradient to adjust model
22
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
23
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
40x speedup over state of the art (3 vs. 114 s/frame) better tracking: noise helps escape local minima robustness wrt. clutter, shadows, occlusions, ... Ongoing work at NICTA: use multiple, ordinary (even cheap) video cameras simultaneous real-time tracking of hands, face & body
24
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
Online SVM aka NORMA (Kivinen, Smola, Williamson 2004):
stochastic gradient in expansion coefficients employs scalar gain η Application of SMD: v is function in RKHS <g,v> can be maintained incrementally in O(n) NIPS’05 workshop (large-scale kernel machines)
25
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
policy gradient reinforcement learning (NIPS’05) generalized Hebbian algorithm for Kernel PCA parameter estimation in conditional random fields
SMD convergence and stability analysis further refinement of the algorithm
26
The imagination driving Australia’s ICT future.
Statistical Machine Learning Program www.nicta.com.au
data-rich ML problems need stochastic approximation classical gradient methods are not up to the task SMD: excellent gain adaptation for stochastic gradient (Hv product gives cheap second-order information)
increasing demand for stochastic gradient methods SMD can greatly accelerate stochastic gradient
27