Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain - - PowerPoint PPT Presentation

deep random neural field
SMART_READER_LITE
LIVE PREVIEW

Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain - - PowerPoint PPT Presentation

Deep Learning and Physics -- 2019 Deep Random Neural Field Shun-ichi Amari RIKEN Center for Brain Science Araya Brief History of AI and NN First Boom start 1956 ~ AI neural


slide-1
SLIDE 1

Deep Learning and Physics -- 2019

Deep Random Neural Field

Shun-ichi Amari

RIKEN Center for Brain Science; Araya

slide-2
SLIDE 2

Brief History of AI and NN

First Boom: start 1956 ~ AI neural networks--perceptron

Dartmouth Conf. Perceptron

symbol universal computation logic learning machine

Dark period (late 1960~1970’s) stochastic gradient descent learning (1967) for MLP

slide-3
SLIDE 3

Perceptron

F.Rosenblatt, Principles of Neurodynamics, 1961

McCulloch-Pitts neuron 0,1 binary learning Multilayer lateral & feedback connection

x z

slide-4
SLIDE 4

Dee eep Neu Neural Ne Networks

Ros

  • senbla

latt: multilayer

er perceptron x

( , ) z f x W =

( ) ( ) ( )

2

, , , , differentiable : analog neuron L W y f W L W c W = − ∂ → + ∆ ∆ = − ∂ x x x w w w w

learning of hidden neurons

analog neuron

stochastic gradient learning Amari, Tsypkin, 1966~67: error back-prop, 1976

slide-5
SLIDE 5

Information Theory II

  • -Geometrical Theory of Information

Shun-ichi Amari University of Tokyo

Kyoritu Press, Tokyo, 1968

First stochastic descent learning of MLP (1967;1968)

slide-6
SLIDE 6

( ) { } { }

1 1 2 2 3 4

, max , min , f v v = ⋅ ⋅ + ⋅ ⋅ x w x w x w x w x θ

x

1

w

4

w max max

1

v y

2

v

slide-7
SLIDE 7
slide-8
SLIDE 8

Second Boom

1970~ AI 1980~ neural networks expert system MLP (backprop) (MYCIN) associative memory

stochastic inference (Bayes) chess (1997)

slide-9
SLIDE 9

Third Boom 2010~

Deep learning Stochastic inference (graphical model; Bayesian; WATSON) Deep learning

pattern recognition: vision, auditory, sentence analysis, machine translation alpha-go

Language processing; sequence and dynamics (word2vec, deep learning with rec. net)

Integration of (symbol, logic) vs (pattern, dynamics)

slide-10
SLIDE 10

De Deep L p Learni rning

Self-Organization + Supervised Learning

RBM: Restricted Boltzmann Machine Auto-Encoder, Recurrent Net

Dropout Contrastive divergence

Convolution Resnet ReLU Adversarial net

slide-11
SLIDE 11

Victory of Deep Neural Networks

Hinton 2005, 2006 ~ 2012 many others

visual pattern, auditory pattern Go-game sentence analysis, machine translation adversarial network, pattern generation

slide-12
SLIDE 12

Mathematical Neuroscience searches for the principles

mathematical studies using simple idealistic models (not realistic) Computational neuroscience AI : technological realization

slide-13
SLIDE 13

Mathematical Neuroscience and Brain

Brain has found and implemented the principles

through evolution (random search) historical restriction material restriction

Very complex (not smartly designed)

slide-14
SLIDE 14

Theoretical Problems on Learning: 1

Local solution and global solution

Simulated annealing Quantum annealing

:

Θ

( )

Θ L

slide-15
SLIDE 15

Theoretical Problem of Learning: 2 training loss and generalization loss :overtraining

2 2

( , ) 1 | ( , ) | [| ( , ) | ]

emp i i gen gen emp

y f x L y f x N L E y f x P L L N θ ε θ θ = + = − = − ≈ +

generalization loss Training loss

slide-16
SLIDE 16

Extremely wdie network P-> ∞: P>>N Local minimum =global minimum

Kawaguchi, 2019

slide-17
SLIDE 17

Learning curve P>>N Double descent

Belkin et al. 2019;Hastie et al. 2019

slide-18
SLIDE 18

Random Neural Network

Random is excellent !! Random is magic!!

Statistical dynamics Random code

slide-19
SLIDE 19

Random Deep Networks

Poole et. Al., 2016 Schoenholtz et. Al., 2017 ~~ Signal propagation Error back propagation

slide-20
SLIDE 20

Jacot et al; Neural tangent kernel

ここに数式を入力します。

2 2

1 ( , ); ( , ) ( ( , )) 2 1 ( , ) { ( , )} ; ( , ) ( , *) 2 ( ')( ( , ) ( , *)) ( , ) ( ) ( ) ( ') ( ', )

t t t

y f x l x y f x l x y f x e f x f x l f x f x f x f x f x f x f x e x

θ θ θ θ θ

θ θ θ θ θ θ θ θ η η θ θ θ θ η θ = = − = − = − ∂ = − ∂ = − ∂ − ∂ = ∂ ∂ = − ∂ ⋅∂

K; Gaussian kernel

slide-21
SLIDE 21

( , '; ) ( ) ( ') ( , ) ( , '; ) ( ', )

t

K x x f x f x f x K x x e x

θ θ

θ θ η θ θ = ∂ ⋅∂ ∂ = − < >

( , '; ) ( , ') : Gaussian kernel

initial t ini

K x x K x x θ θ θ ≈ ≈

slide-22
SLIDE 22

Theorem P>>N Optimal solution lies near a random network.

Bailey et al 2019

1 ( ) 1 ( )

ij ij

w O n w O n = ∆ =

random

slide-23
SLIDE 23

Random Neural Field

1

( ') ( , ') ( ) ( ') ( ') ( ( '))

l l

u z w z z x z dz b z x z u z ϕ

= + =

( ', ) : randam (0 mean Gaussian correla ) e ; t d w z z

slide-24
SLIDE 24

Stati tisti tical Ne Neurodynam amics

microdynamics

( ) ( )

1

w

t T t + = x x x x

( )

( )

sgn W t = x

1

( )

t t

X F X

+ =

2 1 3 3 1

( ) ( ) ( ) ( )?

W W W

X X X T X X X T T = = = =

2

x x x x

macrodynamics

: macrostate

slide-25
SLIDE 25

Statistical Neurodynamics

Rozonoer (1969) Amari (1969, 1971; 1973)

Sompolinski Amari et al (2013) Toyoizumi et al (2015) Poole, …, Ganguli (2016) Schoenholz et al (2017) Yang & Schoenholtz (2017), Karakida, et al (2019) Jacot et al. (2019) ……

~ (0, 1)

ij

w N

Macroscopic behaviors common to almost all (typical) networks

slide-26
SLIDE 26

Random Deep Networks

1 2 1

( ) 1 ( )

l l l ij j i i i l l l l l

x w x w A x n A F A ϕ

+ +

= + = =

∑ ∑

2 2

~ (0, / ) ~ (0, )

ij l i b

w N n w N σ σ

slide-27
SLIDE 27

Macroscopic variables

2 1 1

1 activity : distance: = [ : '] ( ) ( )

i l l l l

A x n D D A F A D K D

+ +

= = =

metric,curvature & Fisher information x x

slide-28
SLIDE 28

Dynamics of Activity: law of large numbers

2 2 1 2

( ) ( ) : ( ) ~ (0, ) 1 ( ) [ ( ) ] ( ) ( ) ( ) ~ (0,1)

i ik k i i i i i l

x w x b u x Wx b u N A A x E u A n A Av Dv v N ϕ ϕ φ ϕ χ χ ϕ

+

= + = = + = = = =

∑ ∑ ∫

   

slide-29
SLIDE 29

2

'(0) 1 ( ) converge

i

A A x χ χ > = →

slide-30
SLIDE 30

Pullback Metric & Curvature

2

1

l i j ij l

ds g dx dx d d n = = ⋅

x x

( ) x Wx φ = 

slide-31
SLIDE 31

Basis vectors

( ) ( )

1 1 1 1 1 1

1 1 1 1

( .. : . ) . Jacob . n . ia

l l l l l l l l l l l l l

i i i i i i i i l l l i i i i i l m m l l l l m m a a a

B dx u W dx B dx d Bd B Bd B B B u W ϕ ϕ

− − − − − −

− − − −

′ = = = = ′ = = =

∑ ∑

  x x x e e e

ここに数式を入力します。

( ) x Wx φ = 

slide-32
SLIDE 32

1

ab a b l

g n = ⋅ e e

slide-33
SLIDE 33

Dynamics of Metric

2 2 2 1

( ) ( '( ) ) E[ '( )) ] E[ '( )) ]E[ ] mean field approximation ( ) '( )

a a k k a a a a k a k a b ab k j kj a a a a a k j a k j

dx B dx B B B u w g B B g u w w u w w A Av Dv ϕ ϕ ϕ χ ϕ = = = = = = − − =

∑ ∑ ∫

e e   

slide-34
SLIDE 34

Mectric

( ) ( )

1 1 1 1

1 2 2 2 2 2 2 1

,

l l l l l l l l l

l l l l a b ab ab l l l a b ab i i i i i l i i i l i

g BB g ds g d x d x BB w w u E E u ϕ σ ϕ δ χ σ ϕ

− − − −

− ′ ′

= = = ′ ′   = ≈     ′ =    

∑ ∑

e e

Law of large numbers

slide-35
SLIDE 35

1

( ) ( ( )) ( )

l ij ij

g x x g x χ = ∏

conformal geometry

slide-36
SLIDE 36

1 1 1 1

conformal transformation! ( ) ( ) ( ) ( )

ij ij ij l l ij ij

g x A g x A g χ χ χ δ χ δ = = ⇒ =  rotation, expansion

slide-37
SLIDE 37

Domino Theorem

1 1 1 1 1 1 1 1

1 2 2 1 1 1 1 1 1

L L L L L L L L L L L L L L L L

i i i i i i i i i i i i i i i l l l m m m l l l m m m i

B BB BBB B BB BB B B B B BBBB B W W W δ χ δ δ χ χ χ χ δ

− − − − − − − −

′ − − − − ′ ′ ′ ′ ′ ′ ′

∂ ∂ ∂ = = = ∂ ∂ ∂ ∂ ∂ ∂ = = = Σ = Σ = ∂ ∂ ∂   x x x x x x x x x

slide-38
SLIDE 38

Dynamics of Curvature

2 2

''( )( )( ) '( ) | |

i i ab a b a b i a b a b ab ab ab ab ab

H x u H ϕ ϕ

= ∇ = ∂ ∂ = ⋅ ⋅ + ⋅∂ = + = e w e w e w e H H H H

     

slide-39
SLIDE 39

2 2 2 2 1 2 1 2 1 1

( ) ''( ) 1 ( )( exponential expansion! creation is smal 2 1) ( ! ) 1 l

l l l l ab ab ab

A Av Dv H A A H n χ ϕ χ δ χ χ χ

+

= = + + >

slide-40
SLIDE 40

Poole et al (2016) Deep neural networks

slide-41
SLIDE 41

Distance

[ ]

2

1 ,

i i

D x y n = −

x y

slide-42
SLIDE 42

Dynamics of Distance (Amari, 1974)

( )

2

1 ( , ') ( ') 1 ( , ') ' ' ' 2 ~N(0, V) ' ' V= ( ') E[ (

i i i i i i k k i i k k

D x x x x n C x x x x x x n D A A C u w y u w y A C C A C A ϕ = − = ⋅ = = + − = = = −

∑ ∑ ∑ ∑

 ) ( ' )] C C A C C ε ν ϕ ε ν + − +

slide-43
SLIDE 43

1 1

( ) 1

l l

D K D dD dD χ

+ =

= > 

slide-44
SLIDE 44

Problem! ( , ' ) : ( ) equi-distance property

l l

D D l D K D → → ∞ = x x

slide-45
SLIDE 45

dynamics of distance

lim ( , ) limlim ( , ) limlim ( , )

L L n L L L L L n L L n

D x y D x y D x y

→∞ →∞ →∞ →∞ →∞ →∞

slide-46
SLIDE 46

Feedback Path

Error backprop Fisher Information

slide-47
SLIDE 47

Stochastic model : parameter space manifold of probability distributions

2 2

( ( ......)..) ; ~ (0,1) 1 ( , : ) exp{ ( ( ; )) } ( ) 2 [ log ( , : ) log ( , : )]

x W W

y Wx Wx N p y x W c y x W q x G E p y x W p y x W ds dWGdW ϕ ϕ ε ε ϕ = + = − − = ∇ ∇ =

slide-48
SLIDE 48

Learning: stochastic gradient descent Steepest Direction---Natural Gradient

( )

1 1

, ,

n

l l l l G l θ θ

  ∂ ∂ ∇ =   ∂ ∂   ∇ = ∇  θ

θ

( ) l θ

( , ; )

t t t t t

l x y η θ ∆ = − ∇ θ

slide-49
SLIDE 49

Natural Gradient

( ) ( )

( )

2 1

max KL[p(x, ):p(x, )]= dl l d l d d l G l ε θ θ θ ε

= + − = + ∇ = ∇ θ θ θ θ θ

( , ; )

t t t t t

l x y η θ ∆ = − ∇  θ

slide-50
SLIDE 50

Information Geometry of MLP

Natural Gradient Learning :

  • S. Amari ; H.Y. Park

( ) ( )

1 1 1 1 1 1

1

T t t t t

l G G G G f f G η ε ε

− − − − − +

∂ ∆ = − ∂ = + − ∇ ∇ θ θ θ

slide-51
SLIDE 51

Fisher Information

( ) ( )

1 1 1 2 1 1 1

' ... , (1/ ) , 0 ~ (1/ ), , 0 ~ (1/ ),

m l l l l m m m m m l l l i l m x p l m p l l i j p

G E W W W B BB B W W W W G W W E O n G W W O n l m G O n i j ϕ ϕ ϕ ϕ ϕ ϕ ϕ χ ϕ

− − + − −

  ∂ ∂ =   ∂ ∂   ∂ ∂ ∂ ∂ = = = ∂ ∂ ∂ ∂     ′ = +         = ≠   = ≠    

x

w x x w w

slide-52
SLIDE 52

Unitwise natural gradient

1 W

W G l η

∆ = − ∇

  • Y. Ollivier; Marceau-Caron
slide-53
SLIDE 53

Goodnews and bad news

G*: unitwise-diagonal matrix

1 1

*: * : G G n G G n

− −

→ → ∞ → → ∞

slide-54
SLIDE 54

Karakida theory

eigenvalues of G

( )

2

1 , 1

i i

O n λ λ = =

∑ ∑

distorted Riemannian metric

2

G

slide-55
SLIDE 55

References:

Poole, …, Ganguli (2016) Schoenholz et al (2017) Yang & Schoenholtz (2017), ……

  • S. Amari, R. Karakida & M. Oizumi, Statistical neurodynamics of deep networks:

Geometry of Signal Spaces. arXiv:1808.07169v1, 2018.

  • S. Amari, R. Karakida & M. Oizumi, Fisher information and

natural gradient learning of random deep networks. arXiv:1808.07172v1, 2018 (AISTATS-19).

  • R. Karakida, S. Akaho & S. Amari, Universal statistics of Fisher information in

deep neural networks: Mean field approach. arXiv: 1806.01316, 2018 (AISTATS-19).