Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander - - PowerPoint PPT Presentation

monte carlo methods and neural networks
SMART_READER_LITE
LIVE PREVIEW

Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander - - PowerPoint PPT Presentation

Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander Keller Neural Networks Fully connected layers neurons a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . .


slide-1
SLIDE 1

Monte Carlo Methods and Neural Networks

Noah Gamboa and Alexander Keller

slide-2
SLIDE 2

Neural Networks

Fully connected layers

⌅ neurons

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 2

slide-3
SLIDE 3

Neural Networks

Fully connected layers

⌅ neurons compute max{0,∑j wjai,j}

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 a2,1 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 2

slide-4
SLIDE 4

Monte Carlo Methods all over Neural Networks

Examples

⌅ drop out ⌅ drop connect ⌅ stochastic binarization ⌅ stochastic gradient descent ⌅ fixed pseudo-random matrices for direct feedback alignment ⌅ ... NVIDIA Confidential 3

slide-5
SLIDE 5

Monte Carlo Methods all over Neural Networks

Observations

⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU NVIDIA Confidential 4

slide-6
SLIDE 6

Monte Carlo Methods all over Neural Networks

Observations

⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU ⌅ artificial neural networks – rigid layer structure – expensive to scale in depth – partially trained fully connected NVIDIA Confidential 4

slide-7
SLIDE 7

Monte Carlo Methods all over Neural Networks

Observations

⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU ⌅ artificial neural networks – rigid layer structure – expensive to scale in depth – partially trained fully connected ⌅ goal: explore algorithms linear in time and space NVIDIA Confidential 4

slide-8
SLIDE 8

Partition instead of Dropout

slide-9
SLIDE 9

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) NVIDIA Confidential 6

slide-10
SLIDE 10

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) ⌅ now: assign neuron to partition p = bξ ·Pc out of P – less random number generator calls – all neurons guaranteed to be considered NVIDIA Confidential 6

slide-11
SLIDE 11

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) ⌅ now: assign neuron to partition p = bξ ·Pc out of P – less random number generator calls – all neurons guaranteed to be considered

LeNet on MNIST Average of t = 1/2 to 1/9 dropout Average of P = 2 to 9 partitions Mean accuracy 0.6062 0.6057 StdDev accuracy 0.0106 0.009

NVIDIA Confidential 6

slide-12
SLIDE 12

Partition instead of Dropout

Training accuracy with LeNet on MNIST 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 2 dropout partitions 1/2 dropout

NVIDIA Confidential 7

slide-13
SLIDE 13

Partition instead of Dropout

Training accuracy with LeNet on MNIST 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 3 dropout partitions 1/3 dropout

NVIDIA Confidential 7

slide-14
SLIDE 14

Simulating Discrete Densities

slide-15
SLIDE 15

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

NVIDIA Confidential 9

slide-16
SLIDE 16

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

NVIDIA Confidential 9

slide-17
SLIDE 17

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

– remember to flip sign accordingly NVIDIA Confidential 9

slide-18
SLIDE 18

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ discrete density approximation of the weights

1

– remember to flip sign accordingly – transform jittered equidistant samples using cumulative distribution function of absolute value of weights NVIDIA Confidential 9

slide-19
SLIDE 19

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

NVIDIA Confidential 10

slide-20
SLIDE 20

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

– using a uniform random variable ξ 2 [0,1) we find

select neuron i , Pi1  ξ < Pi satisfying Prob({Pi1  ξ < Pi}) = |wi|

NVIDIA Confidential 10

slide-21
SLIDE 21

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

– using a uniform random variable ξ 2 [0,1) we find

select neuron i , Pi1  ξ < Pi satisfying Prob({Pi1  ξ < Pi}) = |wi|

– transform jittered equidistant samples using cumulative distribution function of absolute value of weights NVIDIA Confidential 10

slide-22
SLIDE 22

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅ partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

0 = P0 < P1 < ··· < Pm = 1

– using a uniform random variable ξ 2 [0,1) we find

select neuron i , Pi1  ξ < Pi satisfying Prob({Pi1  ξ < Pi}) = |wi|

– transform jittered equidistant samples using cumulative distribution function of absolute value of weights ⌅ in fact derivation of quantization to weights in {1,0,+1} – integer weights if a neuron referenced more than once – explains why ternary and binary did not work in some articles – relation to drop connect and drop out, too NVIDIA Confidential 10

slide-23
SLIDE 23

Simulating Discrete Densities

Test accuracy for two layer ReLU feedforward network on MNIST

⌅ able to achieve 97% of accuracy of model by sampling most important 12% of weights!

100 200 300 400 500 600 0.94 0.96 0.98 1 Number of Samples per Neuron Test Accuracy

NVIDIA Confidential 11

slide-24
SLIDE 24

Simulating Discrete Densities

Application to convolutional layers

⌅ sample from distribution of filter (for example, 128x5x5 = 3200) – less redundant than fully connected layers ⌅ LeNet Architecture on CIFAR-10, best accuracy is 0.6912 ⌅ able to get 88% of accuracy of full model at 50% sampled NVIDIA Confidential 12

slide-25
SLIDE 25

Simulating Discrete Densities

Test accuracy for LeNet on CIFAR-10 500 1,000 1,500 2,000 2,500 3,000 0.4 0.6 Number of Samples per Filter Test Accuracy

NVIDIA Confidential 13

slide-26
SLIDE 26

Neural Networks linear in Time and Space

slide-27
SLIDE 27

Neural Networks linear in Time and Space

Number n of neural units

⌅ for L fully connected layers

n =

L

l=1

nl where nl is the number of neurons in layer l

NVIDIA Confidential 15

slide-28
SLIDE 28

Neural Networks linear in Time and Space

Number n of neural units

⌅ for L fully connected layers

n =

L

l=1

nl where nl is the number of neurons in layer l

⌅ number of weights

nw =

L

l=1

nl1 ·nl

NVIDIA Confidential 15

slide-29
SLIDE 29

Neural Networks linear in Time and Space

Number n of neural units

⌅ for L fully connected layers

n =

L

l=1

nl where nl is the number of neurons in layer l

⌅ number of weights

nw =

L

l=1

nl1 ·nl

⌅ choose number of weights per neuron such that n proportional to nw – for example, constant number nw of weights per neuron NVIDIA Confidential 15

slide-30
SLIDE 30

Neural Networks linear in Time and Space

Results 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy LeNet on MNIST LeNet on CIFAR-10

NVIDIA Confidential 16

slide-31
SLIDE 31

Neural Networks linear in Time and Space

Test accuracy for AlexNet on CIFAR-10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy

NVIDIA Confidential 17

slide-32
SLIDE 32

Neural Networks linear in Time and Space

Test accuracy for AlexNet on ILSVRC12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy Top-5 Accuracy Top-1 Accuracy

NVIDIA Confidential 18

slide-33
SLIDE 33

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ complexity bounded by number of paths times depth ⌅ strong indication of relation to Markov chains ⌅ importance sampling by weights NVIDIA Confidential 19

slide-34
SLIDE 34

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 20

slide-35
SLIDE 35

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 a2,1 . . . aL,0 aL,1 aL,2 aL,nL1

NVIDIA Confidential 20

slide-36
SLIDE 36

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

slide-37
SLIDE 37

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

slide-38
SLIDE 38

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

slide-39
SLIDE 39

Neural Networks linear in Time and Space

Sampling paths through networks

⌅ sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1

– guaranteed connectivity NVIDIA Confidential 20

slide-40
SLIDE 40

Neural Networks linear in Time and Space

Test accuracy for 4 layer feedforward network (784/300/300/10) 50 100 150 200 250 300 0.2 0.4 0.6 0.8 1 Number of Per Pixel Paths Through Network Test Accuracy MNIST Fashion MNIST

NVIDIA Confidential 21

slide-41
SLIDE 41

Monte Carlo Methods and Neural Networks

slide-42
SLIDE 42

Monte Carlo Methods and Neural Networks

Summary

⌅ dropout partitions reduce variance – using much less random numbers NVIDIA Confidential 23

slide-43
SLIDE 43

Monte Carlo Methods and Neural Networks

Summary

⌅ dropout partitions reduce variance – using much less random numbers ⌅ simulating discrete densities explains {1,0,1} and integer weights – compression and quantization without retraining NVIDIA Confidential 23

slide-44
SLIDE 44

Monte Carlo Methods and Neural Networks

Summary

⌅ dropout partitions reduce variance – using much less random numbers ⌅ simulating discrete densities explains {1,0,1} and integer weights – compression and quantization without retraining ⌅ neural networks with linear complexity for both inference and training – sparse from scratch – sampling paths through neural networks instead of drop connect and drop out NVIDIA Confidential 23