[PPT] - Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain PowerPoint Presentation

SLIDE 1

Deep Learning and Physics -- 2019

Deep Random Neural Field

Shun-ichi Amari

RIKEN Center for Brain Science； Araya

SLIDE 2

Brief History of AI and NN

First Boom： start 1956 ~ AI neural networks--perceptron

Dartmouth Conf. Perceptron

symbol universal computation logic learning machine

Dark period (late 1960~1970’s) stochastic gradient descent learning (1967) for MLP

SLIDE 3

Perceptron

F.Rosenblatt, Principles of Neurodynamics, 1961

McCulloch-Pitts neuron 0,1 binary learning Multilayer lateral & feedback connection

x z

SLIDE 4

Dee eep Neu Neural Ne Networks

Ros

senbla

latt： multilayer

er perceptron x

( , ) z f x W =

( ) ( ) ( )

2

, , , , differentiable : analog neuron L W y f W L W c W = − ∂ → + ∆ ∆ = − ∂ x x x w w w w

learning of hidden neurons

analog neuron

stochastic gradient learning Amari, Tsypkin, 1966~67: error back-prop, 1976

SLIDE 5

Information Theory II

-Geometrical Theory of Information

Shun-ichi Amari University of Tokyo

Kyoritu Press, Tokyo, 1968

First stochastic descent learning of MLP (1967;1968)

SLIDE 6

( ) { } { }

1 1 2 2 3 4

, max , min , f v v = ⋅ ⋅ + ⋅ ⋅ x w x w x w x w x θ

x

1

w

4

w max max

1

v y

2

v

SLIDE 7

SLIDE 8

Second Boom

1970~ AI 1980~ neural networks expert system MLP (backprop) (MYCIN) associative memory

stochastic inference (Bayes) chess (1997)

SLIDE 9

Third Boom 2010~

Deep learning Stochastic inference (graphical model; Bayesian; WATSON) Deep learning

pattern recognition: vision, auditory, sentence analysis, machine translation alpha-go

Language processing; sequence and dynamics (word2vec, deep learning with rec. net)

Integration of (symbol, logic) vs (pattern, dynamics)

SLIDE 10

De Deep L p Learni rning

Self-Organization + Supervised Learning

RBM: Restricted Boltzmann Machine Auto-Encoder, Recurrent Net

Dropout Contrastive divergence

Convolution Resnet ReLU Adversarial net

SLIDE 11

Victory of Deep Neural Networks

Hinton 2005, 2006 ~ 2012 many others

visual pattern, auditory pattern Go-game sentence analysis, machine translation adversarial network, pattern generation

SLIDE 12

Mathematical Neuroscience searches for the principles

mathematical studies using simple idealistic models (not realistic) Computational neuroscience AI : technological realization

SLIDE 13

Mathematical Neuroscience and Brain

Brain has found and implemented the principles

through evolution (random search) historical restriction material restriction

Very complex (not smartly designed)

SLIDE 14

Theoretical Problems on Learning: 1

Local solution and global solution

Simulated annealing Quantum annealing

:

Θ

( )

Θ L

SLIDE 15

Theoretical Problem of Learning: 2 training loss and generalization loss :overtraining

2 2

( , ) 1 | ( , ) | [| ( , ) | ]

emp i i gen gen emp

y f x L y f x N L E y f x P L L N θ ε θ θ = + = − = − ≈ +

∑

generalization loss Training loss

SLIDE 16

Extremely wdie network P-> ∞： P>>N Local minimum =global minimum

Kawaguchi, 2019

SLIDE 17

Learning curve P>>N Double descent

Belkin et al. 2019；Hastie et al. 2019

SLIDE 18

Random Neural Network

Random is excellent !! Random is magic!!

Statistical dynamics Random code

SLIDE 19

Random Deep Networks

Poole et. Al., 2016 Schoenholtz et. Al., 2017 ~~ Signal propagation Error back propagation

SLIDE 20

Jacot et al; Neural tangent kernel

ここに数式を入力します。

2 2

1 ( , ); ( , ) ( ( , )) 2 1 ( , ) { ( , )} ; ( , ) ( , *) 2 ( ')( ( , ) ( , *)) ( , ) ( ) ( ) ( ') ( ', )

t t t

y f x l x y f x l x y f x e f x f x l f x f x f x f x f x f x f x e x

θ θ θ θ θ

θ θ θ θ θ θ θ θ η η θ θ θ θ η θ = = − = − = − ∂ = − ∂ = − ∂ − ∂ = ∂ ∂ = − ∂ ⋅∂

K; Gaussian kernel

SLIDE 21

( , '; ) ( ) ( ') ( , ) ( , '; ) ( ', )

t

K x x f x f x f x K x x e x

θ θ

θ θ η θ θ = ∂ ⋅∂ ∂ = − < >

( , '; ) ( , ') : Gaussian kernel

initial t ini

K x x K x x θ θ θ ≈ ≈

SLIDE 22

Theorem P>>N Optimal solution lies near a random network.

Bailey et al 2019

1 ( ) 1 ( )

ij ij

w O n w O n = ∆ =

random

SLIDE 23

Random Neural Field

1

( ') ( , ') ( ) ( ') ( ') ( ( '))

l l

u z w z z x z dz b z x z u z ϕ

−

= + =

∫

( ', ) : randam (0 mean Gaussian correla ) e ; t d w z z

SLIDE 24

Stati tisti tical Ne Neurodynam amics

microdynamics

( ) ( )

1

w

t T t + = x x x x

( )

sgn W t = x

1

( )

t t

X F X

+ =

2 1 3 3 1

( ) ( ) ( ) ( )?

W W W

X X X T X X X T T = = = =

2

x x x x

macrodynamics

: macrostate

SLIDE 25

Statistical Neurodynamics

Rozonoer (1969） Amari (1969, 1971; 1973)

Sompolinski Amari et al (2013) Toyoizumi et al (2015) Poole, …, Ganguli (2016) Schoenholz et al (2017) Yang & Schoenholtz (2017), Karakida, et al (2019) Jacot et al. (2019) ……

~ (0, 1)

ij

w N

Macroscopic behaviors common to almost all (typical) networks

SLIDE 26

Random Deep Networks

1 2 1

( ) 1 ( )

l l l ij j i i i l l l l l

x w x w A x n A F A ϕ

+ +

= + = =

∑ ∑

2 2

~ (0, / ) ~ (0, )

ij l i b

w N n w N σ σ

SLIDE 27

Macroscopic variables

2 1 1

1 activity : distance: = [ : '] ( ) ( )

i l l l l

A x n D D A F A D K D

+ +

= = =

∑

metric,curvature & Fisher information x x

SLIDE 28

Dynamics of Activity: law of large numbers

2 2 1 2

( ) ( ) : ( ) ~ (0, ) 1 ( ) [ ( ) ] ( ) ( ) ( ) ~ (0,1)

i ik k i i i i i l

x w x b u x Wx b u N A A x E u A n A Av Dv v N ϕ ϕ φ ϕ χ χ ϕ

+

= + = = + = = = =

∑ ∑ ∫

   

SLIDE 29

2

'(0) 1 ( ) converge

i

A A x χ χ > = →

∑

SLIDE 30

Pullback Metric & Curvature

2

1

l i j ij l

ds g dx dx d d n = = ⋅

∑

x x

( ) x Wx φ = 

SLIDE 31

Basis vectors

( ) ( )

1 1 1 1 1 1

1 1 1 1

( .. : . ) . Jacob . n . ia

l l l l l l l l l l l l l

i i i i i i i i l l l i i i i i l m m l l l l m m a a a

B dx u W dx B dx d Bd B Bd B B B u W ϕ ϕ

− − − − − −

− − − −

′ = = = = ′ = = =

∑ ∑

  x x x e e e

ここに数式を入力します。

( ) x Wx φ = 

SLIDE 32

1

ab a b l

g n = ⋅ e e

SLIDE 33

Dynamics of Metric

2 2 2 1

( ) ( '( ) ) E[ '( )) ] E[ '( )) ]E[ ] mean field approximation ( ) '( )

a a k k a a a a k a k a b ab k j kj a a a a a k j a k j

dx B dx B B B u w g B B g u w w u w w A Av Dv ϕ ϕ ϕ χ ϕ = = = = = = − − =

∑ ∑ ∫

e e   

SLIDE 34

Mectric

( ) ( )

1 1 1 1

1 2 2 2 2 2 2 1

,

l l l l l l l l l

l l l l a b ab ab l l l a b ab i i i i i l i i i l i

g BB g ds g d x d x BB w w u E E u ϕ σ ϕ δ χ σ ϕ

− − − −

− ′ ′

= = = ′ ′   = ≈     ′ =    

∑ ∑

e e

Ｌａｗ of large numbers

SLIDE 35

1

( ) ( ( )) ( )

l ij ij

g x x g x χ = ∏

conformal geometry

SLIDE 36

1 1 1 1

conformal transformation! ( ) ( ) ( ) ( )

ij ij ij l l ij ij

g x A g x A g χ χ χ δ χ δ = = ⇒ =  rotation, expansion

SLIDE 37

Domino Theorem

1 1 1 1 1 1 1 1

1 2 2 1 1 1 1 1 1

L L L L L L L L L L L L L L L L

i i i i i i i i i i i i i i i l l l m m m l l l m m m i

B BB BBB B BB BB B B B B BBBB B W W W δ χ δ δ χ χ χ χ δ

− − − − − − − −

′ − − − − ′ ′ ′ ′ ′ ′ ′

∂ ∂ ∂ = = = ∂ ∂ ∂ ∂ ∂ ∂ = = = Σ = Σ = ∂ ∂ ∂   x x x x x x x x x

SLIDE 38

Dynamics of Curvature

2 2

''( )( )( ) '( ) | |

i i ab a b a b i a b a b ab ab ab ab ab

H x u H ϕ ϕ

⊥

= ∇ = ∂ ∂ = ⋅ ⋅ + ⋅∂ = + = e w e w e w e H H H H



     

SLIDE 39

2 2 2 2 1 2 1 2 1 1

( ) ''( ) 1 ( )( exponential expansion! creation is smal 2 1) ( ! ) 1 l

l l l l ab ab ab

A Av Dv H A A H n χ ϕ χ δ χ χ χ

+

= = + + >

∫

SLIDE 40

Poole et al (2016) Deep neural networks

SLIDE 41

Distance

[ ]

2

1 ,

i i

D x y n = −

∑

x y

SLIDE 42

Dynamics of Distance (Amari, 1974)

( )

2

1 ( , ') ( ') 1 ( , ') ' ' ' 2 ~N(0, V) ' ' V= ( ') E[ (

i i i i i i k k i i k k

D x x x x n C x x x x x x n D A A C u w y u w y A C C A C A ϕ = − = ⋅ = = + − = = = −

∑ ∑ ∑ ∑

 ) ( ' )] C C A C C ε ν ϕ ε ν + − +

SLIDE 43

1 1

( ) 1

l l

D K D dD dD χ

+ =

= > 

SLIDE 44

Problem! ( , ' ) : ( ) equi-distance property

l l

D D l D K D → → ∞ = x x

SLIDE 45

dynamics of distance

lim ( , ) limlim ( , ) limlim ( , )

L L n L L L L L n L L n

D x y D x y D x y

→∞ →∞ →∞ →∞ →∞ →∞

≠

SLIDE 46

Feedback Path

Error backprop Fisher Information

SLIDE 47

Stochastic model : parameter space manifold of probability distributions

2 2

( ( ......)..) ; ~ (0,1) 1 ( , : ) exp{ ( ( ; )) } ( ) 2 [ log ( , : ) log ( , : )]

x W W

y Wx Wx N p y x W c y x W q x G E p y x W p y x W ds dWGdW ϕ ϕ ε ε ϕ = + = − − = ∇ ∇ =

SLIDE 48

Learning: stochastic gradient descent Steepest Direction---Natural Gradient



( )

1 1

, ,

n

l l l l G l θ θ

−

  ∂ ∂ ∇ =   ∂ ∂   ∇ = ∇  θ

dθ

θ

( ) l θ

( , ; )

t t t t t

l x y η θ ∆ = − ∇ θ

SLIDE 49

Natural Gradient

( ) ( )



( )

2 1

max KL[p(x, ):p(x, )]= dl l d l d d l G l ε θ θ θ ε

−

= + − = + ∇ = ∇ θ θ θ θ θ

( , ; )

t t t t t

l x y η θ ∆ = − ∇  θ

SLIDE 50

Information Geometry of MLP

Natural Gradient Learning :

S. Amari ; H.Y. Park

( ) ( )

1 1 1 1 1 1

1

T t t t t

l G G G G f f G η ε ε

− − − − − +

∂ ∆ = − ∂ = + − ∇ ∇ θ θ θ

SLIDE 51

Fisher Information

( ) ( )

1 1 1 2 1 1 1

' ... , (1/ ) , 0 ~ (1/ ), , 0 ~ (1/ ),

m l l l l m m m m m l l l i l m x p l m p l l i j p

G E W W W B BB B W W W W G W W E O n G W W O n l m G O n i j ϕ ϕ ϕ ϕ ϕ ϕ ϕ χ ϕ

− − + − −

  ∂ ∂ =   ∂ ∂   ∂ ∂ ∂ ∂ = = = ∂ ∂ ∂ ∂     ′ = +         = ≠   = ≠    

∏

x

w x x w w

SLIDE 52

Unitwise natural gradient

1 W

W G l η

−

∆ = − ∇

Y. Ollivier; Marceau-Caron

SLIDE 53

Goodnews and bad news

G*： unitwise-diagonal matrix

1 1

: : G G n G G n

− −

→ → ∞ → → ∞

SLIDE 54

Karakida theory

eigenvalues of G

( )

2

1 , 1

i i

O n λ λ = =

∑ ∑

distorted Riemannian metric

2

G

SLIDE 55

References:

Poole, …, Ganguli (2016) Schoenholz et al (2017) Yang & Schoenholtz (2017), ……

S. Amari, R. Karakida & M. Oizumi, Statistical neurodynamics of deep networks:

Geometry of Signal Spaces. arXiv:1808.07169v1, 2018.

S. Amari, R. Karakida & M. Oizumi, Fisher information and

natural gradient learning of random deep networks. arXiv:1808.07172v1, 2018 (AISTATS-19).

R. Karakida, S. Akaho & S. Amari, Universal statistics of Fisher information in

deep neural networks: Mean field approach. arXiv: 1806.01316, 2018 (AISTATS-19).