Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain - - PowerPoint PPT Presentation
Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain - - PowerPoint PPT Presentation
Deep Learning and Physics -- 2019 Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain Science Araya Brief History of AI and NN First Boom start 1956 ~ AI neural
Brief History of AI and NN
First Boom: start 1956 ~ AI neural networks--perceptron
Dartmouth Conf. Perceptron
symbol universal computation logic learning machine
Dark period (late 1960~1970’s) stochastic gradient descent learning (1967) for MLP
Perceptron
F.Rosenblatt, Principles of Neurodynamics, 1961
McCulloch-Pitts neuron 0,1 binary learning Multilayer lateral & feedback connection
x z
Dee eep Neu Neural Ne Networks
Ros
- senbla
latt: multilayer
er perceptron x
( , ) z f x W =
( ) ( ) ( )
2
, , , , differentiable : analog neuron L W y f W L W c W = − ∂ → + ∆ ∆ = − ∂ x x x w w w w
learning of hidden neurons
analog neuron
stochastic gradient learning Amari, Tsypkin, 1966~67: error back-prop, 1976
Information Theory II
- -Geometrical Theory of Information
Shun-ichi Amari University of Tokyo
Kyoritu Press, Tokyo, 1968
First stochastic descent learning of MLP (1967;1968)
( ) { } { }
1 1 2 2 3 4
, max , min , f v v = ⋅ ⋅ + ⋅ ⋅ x w x w x w x w x θ
x
1
w
4
w max max
1
v y
2
v
Second Boom
1970~ AI 1980~ neural networks expert system MLP (backprop) (MYCIN) associative memory
stochastic inference (Bayes) chess (1997)
Third Boom 2010~
Deep learning Stochastic inference (graphical model; Bayesian; WATSON) Deep learning
pattern recognition: vision, auditory, sentence analysis, machine translation alpha-go
Language processing; sequence and dynamics (word2vec, deep learning with rec. net)
Integration of (symbol, logic) vs (pattern, dynamics)
De Deep L p Learni rning
Self-Organization + Supervised Learning
RBM: Restricted Boltzmann Machine Auto-Encoder, Recurrent Net
Dropout Contrastive divergence
Convolution Resnet ReLU Adversarial net
Victory of Deep Neural Networks
Hinton 2005, 2006 ~ 2012 many others
visual pattern, auditory pattern Go-game sentence analysis, machine translation adversarial network, pattern generation
Mathematical Neuroscience searches for the principles
mathematical studies using simple idealistic models (not realistic) Computational neuroscience AI : technological realization
Mathematical Neuroscience and Brain
Brain has found and implemented the principles
through evolution (random search) historical restriction material restriction
Very complex (not smartly designed)
Theoretical Problems on Learning: 1
Local solution and global solution
Simulated annealing Quantum annealing
:
Θ
( )
Θ L
Theoretical Problem of Learning: 2 training loss and generalization loss :overtraining
2 2
( , ) 1 | ( , ) | [| ( , ) | ]
emp i i gen gen emp
y f x L y f x N L E y f x P L L N θ ε θ θ = + = − = − ≈ +
∑
generalization loss Training loss
Extremely wdie network P-> ∞: P>>N Local minimum =global minimum
Kawaguchi, 2019
Learning curve P>>N Double descent
Belkin et al. 2019;Hastie et al. 2019
Random Neural Network
Random is excellent !! Random is magic!!
Statistical dynamics Random code
Random Deep Networks
Poole et. Al., 2016 Schoenholtz et. Al., 2017 ~~ Signal propagation Error back propagation
Jacot et al; Neural tangent kernel
ここに数式を入力します。
2 2
1 ( , ); ( , ) ( ( , )) 2 1 ( , ) { ( , )} ; ( , ) ( , *) 2 ( ')( ( , ) ( , *)) ( , ) ( ) ( ) ( ') ( ', )
t t t
y f x l x y f x l x y f x e f x f x l f x f x f x f x f x f x f x e x
θ θ θ θ θ
θ θ θ θ θ θ θ θ η η θ θ θ θ η θ = = − = − = − ∂ = − ∂ = − ∂ − ∂ = ∂ ∂ = − ∂ ⋅∂
K; Gaussian kernel
( , '; ) ( ) ( ') ( , ) ( , '; ) ( ', )
t
K x x f x f x f x K x x e x
θ θ
θ θ η θ θ = ∂ ⋅∂ ∂ = − < >
( , '; ) ( , ') : Gaussian kernel
initial t ini
K x x K x x θ θ θ ≈ ≈
Theorem P>>N Optimal solution lies near a random network.
Bailey et al 2019
1 ( ) 1 ( )
ij ij
w O n w O n = ∆ =
random
Random Neural Field
1
( ') ( , ') ( ) ( ') ( ') ( ( '))
l l
u z w z z x z dz b z x z u z ϕ
−
= + =
∫
( ', ) : randam (0 mean Gaussian correla ) e ; t d w z z
Stati tisti tical Ne Neurodynam amics
microdynamics
( ) ( )
1
w
t T t + = x x x x
( )
( )
sgn W t = x
1
( )
t t
X F X
+ =
2 1 3 3 1
( ) ( ) ( ) ( )?
W W W
X X X T X X X T T = = = =
2
x x x x
macrodynamics
: macrostate
Statistical Neurodynamics
Rozonoer (1969) Amari (1969, 1971; 1973)
Sompolinski Amari et al (2013) Toyoizumi et al (2015) Poole, …, Ganguli (2016) Schoenholz et al (2017) Yang & Schoenholtz (2017), Karakida, et al (2019) Jacot et al. (2019) ……
~ (0, 1)
ij
w N
Macroscopic behaviors common to almost all (typical) networks
Random Deep Networks
1 2 1
( ) 1 ( )
l l l ij j i i i l l l l l
x w x w A x n A F A ϕ
+ +
= + = =
∑ ∑
2 2
~ (0, / ) ~ (0, )
ij l i b
w N n w N σ σ
Macroscopic variables
2 1 1
1 activity : distance: = [ : '] ( ) ( )
i l l l l
A x n D D A F A D K D
+ +
= = =
∑
metric,curvature & Fisher information x x
Dynamics of Activity: law of large numbers
2 2 1 2
( ) ( ) : ( ) ~ (0, ) 1 ( ) [ ( ) ] ( ) ( ) ( ) ~ (0,1)
i ik k i i i i i l
x w x b u x Wx b u N A A x E u A n A Av Dv v N ϕ ϕ φ ϕ χ χ ϕ
+
= + = = + = = = =
∑ ∑ ∫
2
'(0) 1 ( ) converge
i
A A x χ χ > = →
∑
Pullback Metric & Curvature
2
1
l i j ij l
ds g dx dx d d n = = ⋅
∑
x x
( ) x Wx φ =
Basis vectors
( ) ( )
1 1 1 1 1 1
1 1 1 1
( .. : . ) . Jacob . n . ia
l l l l l l l l l l l l l
i i i i i i i i l l l i i i i i l m m l l l l m m a a a
B dx u W dx B dx d Bd B Bd B B B u W ϕ ϕ
− − − − − −
− − − −
′ = = = = ′ = = =
∑ ∑
x x x e e e
ここに数式を入力します。
( ) x Wx φ =
1
ab a b l
g n = ⋅ e e
Dynamics of Metric
2 2 2 1
( ) ( '( ) ) E[ '( )) ] E[ '( )) ]E[ ] mean field approximation ( ) '( )
a a k k a a a a k a k a b ab k j kj a a a a a k j a k j
dx B dx B B B u w g B B g u w w u w w A Av Dv ϕ ϕ ϕ χ ϕ = = = = = = − − =
∑ ∑ ∫
e e
Mectric
( ) ( )
1 1 1 1
1 2 2 2 2 2 2 1
,
l l l l l l l l l
l l l l a b ab ab l l l a b ab i i i i i l i i i l i
g BB g ds g d x d x BB w w u E E u ϕ σ ϕ δ χ σ ϕ
− − − −
− ′ ′
= = = ′ ′ = ≈ ′ =
∑ ∑
e e
Law of large numbers
1
( ) ( ( )) ( )
l ij ij
g x x g x χ = ∏
conformal geometry
1 1 1 1
conformal transformation! ( ) ( ) ( ) ( )
ij ij ij l l ij ij
g x A g x A g χ χ χ δ χ δ = = ⇒ = rotation, expansion
Domino Theorem
1 1 1 1 1 1 1 1
1 2 2 1 1 1 1 1 1
L L L L L L L L L L L L L L L L
i i i i i i i i i i i i i i i l l l m m m l l l m m m i
B BB BBB B BB BB B B B B BBBB B W W W δ χ δ δ χ χ χ χ δ
− − − − − − − −
′ − − − − ′ ′ ′ ′ ′ ′ ′
∂ ∂ ∂ = = = ∂ ∂ ∂ ∂ ∂ ∂ = = = Σ = Σ = ∂ ∂ ∂ x x x x x x x x x
Dynamics of Curvature
2 2
''( )( )( ) '( ) | |
i i ab a b a b i a b a b ab ab ab ab ab
H x u H ϕ ϕ
⊥
= ∇ = ∂ ∂ = ⋅ ⋅ + ⋅∂ = + = e w e w e w e H H H H
2 2 2 2 1 2 1 2 1 1
( ) ''( ) 1 ( )( exponential expansion! creation is smal 2 1) ( ! ) 1 l
l l l l ab ab ab
A Av Dv H A A H n χ ϕ χ δ χ χ χ
+
= = + + >
∫
Poole et al (2016) Deep neural networks
Distance
[ ]
2
1 ,
i i
D x y n = −
∑
x y
Dynamics of Distance (Amari, 1974)
( )
2
1 ( , ') ( ') 1 ( , ') ' ' ' 2 ~N(0, V) ' ' V= ( ') E[ (
i i i i i i k k i i k k
D x x x x n C x x x x x x n D A A C u w y u w y A C C A C A ϕ = − = ⋅ = = + − = = = −
∑ ∑ ∑ ∑
) ( ' )] C C A C C ε ν ϕ ε ν + − +
1 1
( ) 1
l l
D K D dD dD χ
+ =
= >
Problem! ( , ' ) : ( ) equi-distance property
l l
D D l D K D → → ∞ = x x
dynamics of distance
lim ( , ) limlim ( , ) limlim ( , )
L L n L L L L L n L L n
D x y D x y D x y
→∞ →∞ →∞ →∞ →∞ →∞
≠
Feedback Path
Error backprop Fisher Information
Stochastic model : parameter space manifold of probability distributions
2 2
( ( ......)..) ; ~ (0,1) 1 ( , : ) exp{ ( ( ; )) } ( ) 2 [ log ( , : ) log ( , : )]
x W W
y Wx Wx N p y x W c y x W q x G E p y x W p y x W ds dWGdW ϕ ϕ ε ε ϕ = + = − − = ∇ ∇ =
Learning: stochastic gradient descent Steepest Direction---Natural Gradient
( )
1 1
, ,
n
l l l l G l θ θ
−
∂ ∂ ∇ = ∂ ∂ ∇ = ∇ θ
dθ
θ
( ) l θ
( , ; )
t t t t t
l x y η θ ∆ = − ∇ θ
Natural Gradient
( ) ( )
( )
2 1
max KL[p(x, ):p(x, )]= dl l d l d d l G l ε θ θ θ ε
−
= + − = + ∇ = ∇ θ θ θ θ θ
( , ; )
t t t t t
l x y η θ ∆ = − ∇ θ
Information Geometry of MLP
Natural Gradient Learning :
- S. Amari ; H.Y. Park
( ) ( )
1 1 1 1 1 1
1
T t t t t
l G G G G f f G η ε ε
− − − − − +
∂ ∆ = − ∂ = + − ∇ ∇ θ θ θ
Fisher Information
( ) ( )
1 1 1 2 1 1 1
' ... , (1/ ) , 0 ~ (1/ ), , 0 ~ (1/ ),
m l l l l m m m m m l l l i l m x p l m p l l i j p
G E W W W B BB B W W W W G W W E O n G W W O n l m G O n i j ϕ ϕ ϕ ϕ ϕ ϕ ϕ χ ϕ
− − + − −
∂ ∂ = ∂ ∂ ∂ ∂ ∂ ∂ = = = ∂ ∂ ∂ ∂ ′ = + = ≠ = ≠
∏
x
w x x w w
Unitwise natural gradient
1 W
W G l η
−
∆ = − ∇
- Y. Ollivier; Marceau-Caron
Goodnews and bad news
G*: unitwise-diagonal matrix
1 1
*: * : G G n G G n
− −
→ → ∞ → → ∞
Karakida theory
eigenvalues of G
( )
2
1 , 1
i i
O n λ λ = =
∑ ∑
distorted Riemannian metric
2
G
References:
Poole, …, Ganguli (2016) Schoenholz et al (2017) Yang & Schoenholtz (2017), ……
- S. Amari, R. Karakida & M. Oizumi, Statistical neurodynamics of deep networks:
Geometry of Signal Spaces. arXiv:1808.07169v1, 2018.
- S. Amari, R. Karakida & M. Oizumi, Fisher information and
natural gradient learning of random deep networks. arXiv:1808.07172v1, 2018 (AISTATS-19).
- R. Karakida, S. Akaho & S. Amari, Universal statistics of Fisher information in
deep neural networks: Mean field approach. arXiv: 1806.01316, 2018 (AISTATS-19).