[PPT] - Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander PowerPoint Presentation

SLIDE 1

Monte Carlo Methods and Neural Networks

Noah Gamboa and Alexander Keller

SLIDE 2

Neural Networks

Fully connected layers

⌅ neurons

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 2

SLIDE 3

Neural Networks

Fully connected layers

⌅ neurons compute max{0,∑j wjai,j}

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 a2,1 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 2

SLIDE 4

Monte Carlo Methods all over Neural Networks

Examples

⌅ drop out ⌅ drop connect ⌅ stochastic binarization ⌅ stochastic gradient descent ⌅ fixed pseudo-random matrices for direct feedback alignment ⌅ ... NVIDIA Confidential 3

SLIDE 5

Monte Carlo Methods all over Neural Networks

Observations

⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU NVIDIA Confidential 4

SLIDE 6

Monte Carlo Methods all over Neural Networks

Observations

⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU ⌅ artificial neural networks – rigid layer structure – expensive to scale in depth – partially trained fully connected NVIDIA Confidential 4

SLIDE 7

Monte Carlo Methods all over Neural Networks

Observations

⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU ⌅ artificial neural networks – rigid layer structure – expensive to scale in depth – partially trained fully connected ⌅ goal: explore algorithms linear in time and space NVIDIA Confidential 4

SLIDE 8

Partition instead of Dropout

SLIDE 9

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) NVIDIA Confidential 6

SLIDE 10

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) ⌅ now: assign neuron to partition p = bξ ·Pc out of P – less random number generator calls – all neurons guaranteed to be considered NVIDIA Confidential 6

SLIDE 11

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) ⌅ now: assign neuron to partition p = bξ ·Pc out of P – less random number generator calls – all neurons guaranteed to be considered

LeNet on MNIST Average of t = 1/2 to 1/9 dropout Average of P = 2 to 9 partitions Mean accuracy 0.6062 0.6057 StdDev accuracy 0.0106 0.009

NVIDIA Confidential 6

SLIDE 12

Partition instead of Dropout

Training accuracy with LeNet on MNIST 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 2 dropout partitions 1/2 dropout

NVIDIA Confidential 7

SLIDE 13

Partition instead of Dropout

Training accuracy with LeNet on MNIST 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 3 dropout partitions 1/3 dropout

NVIDIA Confidential 7

SLIDE 14

Simulating Discrete Densities

SLIDE 15

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

NVIDIA Confidential 9

SLIDE 16

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

NVIDIA Confidential 9

SLIDE 17

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

– remember to flip sign accordingly NVIDIA Confidential 9

SLIDE 18

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

– remember to flip sign accordingly – transform jittered equidistant samples using cumulative distribution function of absolute value of weights NVIDIA Confidential 9

SLIDE 19

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

NVIDIA Confidential 10

SLIDE 20

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

– using a uniform random variable ξ 2 [0,1) we find

select neuron i , Pi1  ξ < Pi satisfying Prob({Pi1  ξ < Pi}) = |wi|

NVIDIA Confidential 10

SLIDE 21

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

– using a uniform random variable ξ 2 [0,1) we find

select neuron i , Pi1  ξ < Pi satisfying Prob({Pi1  ξ < Pi}) = |wi|

– transform jittered equidistant samples using cumulative distribution function of absolute value of weights NVIDIA Confidential 10

SLIDE 22

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

– using a uniform random variable ξ 2 [0,1) we find

select neuron i , Pi1  ξ < Pi satisfying Prob({Pi1  ξ < Pi}) = |wi|

– transform jittered equidistant samples using cumulative distribution function of absolute value of weights ⌅ in fact derivation of quantization to weights in {1,0,+1} – integer weights if a neuron referenced more than once – explains why ternary and binary did not work in some articles – relation to drop connect and drop out, too NVIDIA Confidential 10

SLIDE 23

Simulating Discrete Densities

Test accuracy for two layer ReLU feedforward network on MNIST

⌅ able to achieve 97% of accuracy of model by sampling most important 12% of weights!

100 200 300 400 500 600 0.94 0.96 0.98 1 Number of Samples per Neuron Test Accuracy

NVIDIA Confidential 11

SLIDE 24

Simulating Discrete Densities

Application to convolutional layers

⌅ sample from distribution of filter (for example, 128x5x5 = 3200) – less redundant than fully connected layers ⌅ LeNet Architecture on CIFAR-10, best accuracy is 0.6912 ⌅ able to get 88% of accuracy of full model at 50% sampled NVIDIA Confidential 12

SLIDE 25

Simulating Discrete Densities

Test accuracy for LeNet on CIFAR-10 500 1,000 1,500 2,000 2,500 3,000 0.4 0.6 Number of Samples per Filter Test Accuracy

NVIDIA Confidential 13

SLIDE 26

Neural Networks linear in Time and Space

SLIDE 27

Neural Networks linear in Time and Space

Number n of neural units

⌅ for L fully connected layers

n =

L

∑

l=1

nl where nl is the number of neurons in layer l

NVIDIA Confidential 15

SLIDE 28

Neural Networks linear in Time and Space

Number n of neural units

⌅ for L fully connected layers

n =

L

∑

l=1

nl where nl is the number of neurons in layer l

⌅ number of weights

nw =

L

∑

l=1

nl1 ·nl

NVIDIA Confidential 15

SLIDE 29

Neural Networks linear in Time and Space

Number n of neural units

⌅ for L fully connected layers

n =

L

∑

l=1

nl where nl is the number of neurons in layer l

⌅ number of weights

nw =

L

∑

l=1

nl1 ·nl

⌅ choose number of weights per neuron such that n proportional to nw – for example, constant number nw of weights per neuron NVIDIA Confidential 15

SLIDE 30

Neural Networks linear in Time and Space

Results 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy LeNet on MNIST LeNet on CIFAR-10

NVIDIA Confidential 16

SLIDE 31

Neural Networks linear in Time and Space

Test accuracy for AlexNet on CIFAR-10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy

NVIDIA Confidential 17

SLIDE 32

Neural Networks linear in Time and Space

Test accuracy for AlexNet on ILSVRC12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy Top-5 Accuracy Top-1 Accuracy

NVIDIA Confidential 18

SLIDE 33

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ complexity bounded by number of paths times depth ⌅ strong indication of relation to Markov chains ⌅ importance sampling by weights NVIDIA Confidential 19

SLIDE 34

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 20

SLIDE 35

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 a2,1 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 20

SLIDE 36

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

SLIDE 37

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

SLIDE 38

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

SLIDE 39

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

SLIDE 40

Neural Networks linear in Time and Space

Test accuracy for 4 layer feedforward network (784/300/300/10) 50 100 150 200 250 300 0.2 0.4 0.6 0.8 1 Number of Per Pixel Paths Through Network Test Accuracy MNIST Fashion MNIST

NVIDIA Confidential 21

SLIDE 41

Monte Carlo Methods and Neural Networks

SLIDE 42

Monte Carlo Methods and Neural Networks

Summary

⌅ dropout partitions reduce variance – using much less random numbers NVIDIA Confidential 23

SLIDE 43

Monte Carlo Methods and Neural Networks

Summary

⌅ dropout partitions reduce variance – using much less random numbers ⌅ simulating discrete densities explains {1,0,1} and integer weights – compression and quantization without retraining NVIDIA Confidential 23

SLIDE 44

Monte Carlo Methods and Neural Networks

Summary

⌅ dropout partitions reduce variance – using much less random numbers ⌅ simulating discrete densities explains {1,0,1} and integer weights – compression and quantization without retraining ⌅ neural networks with linear complexity for both inference and training – sparse from scratch – sampling paths through neural networks instead of drop connect and drop out NVIDIA Confidential 23