Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander - - PowerPoint PPT Presentation
Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander - - PowerPoint PPT Presentation
Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander Keller Neural Networks Fully connected layers neurons a 0 , 0 a L , 0 a 1 , 0 a 2 , 0 a 0 , 1 a L , 1 a 1 , 1 a 2 , 1 a 0 , 2 a L , 2 a 1 , 2 a 2 , 2 . . . . . . . . .
Neural Networks
Fully connected layers
⌅ neurons
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1
NVIDIA Confidential 2
Neural Networks
Fully connected layers
⌅ neurons compute max{0,∑j wjai,j}
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 a2,1 . . . aL,0 aL,1 aL,2 aL,nL1
NVIDIA Confidential 2
Monte Carlo Methods all over Neural Networks
Examples
⌅ drop out ⌅ drop connect ⌅ stochastic binarization ⌅ stochastic gradient descent ⌅ fixed pseudo-random matrices for direct feedback alignment ⌅ ... NVIDIA Confidential 3
Monte Carlo Methods all over Neural Networks
Observations
⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU NVIDIA Confidential 4
Monte Carlo Methods all over Neural Networks
Observations
⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU ⌅ artificial neural networks – rigid layer structure – expensive to scale in depth – partially trained fully connected NVIDIA Confidential 4
Monte Carlo Methods all over Neural Networks
Observations
⌅ the brain – about 1011 nerve cells with to up to 104 connections to others – much more energy efficient than a GPU ⌅ artificial neural networks – rigid layer structure – expensive to scale in depth – partially trained fully connected ⌅ goal: explore algorithms linear in time and space NVIDIA Confidential 4
Partition instead of Dropout
Partition instead of Dropout
Guaranteeing coverage of neural units
⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) NVIDIA Confidential 6
Partition instead of Dropout
Guaranteeing coverage of neural units
⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) ⌅ now: assign neuron to partition p = bξ ·Pc out of P – less random number generator calls – all neurons guaranteed to be considered NVIDIA Confidential 6
Partition instead of Dropout
Guaranteeing coverage of neural units
⌅ so far: dropout neuron if threshold t > ξ – ξ by linear feedback register generator (for example) ⌅ now: assign neuron to partition p = bξ ·Pc out of P – less random number generator calls – all neurons guaranteed to be considered
LeNet on MNIST Average of t = 1/2 to 1/9 dropout Average of P = 2 to 9 partitions Mean accuracy 0.6062 0.6057 StdDev accuracy 0.0106 0.009
NVIDIA Confidential 6
Partition instead of Dropout
Training accuracy with LeNet on MNIST 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 2 dropout partitions 1/2 dropout
NVIDIA Confidential 7
Partition instead of Dropout
Training accuracy with LeNet on MNIST 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 3 dropout partitions 1/3 dropout
NVIDIA Confidential 7
Simulating Discrete Densities
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ discrete density approximation of the weights
1
NVIDIA Confidential 9
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ discrete density approximation of the weights
1
NVIDIA Confidential 9
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ discrete density approximation of the weights
1
– remember to flip sign accordingly NVIDIA Confidential 9
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ discrete density approximation of the weights
1
– remember to flip sign accordingly – transform jittered equidistant samples using cumulative distribution function of absolute value of weights NVIDIA Confidential 9
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
0 = P0 < P1 < ··· < Pm = 1
NVIDIA Confidential 10
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
0 = P0 < P1 < ··· < Pm = 1
– using a uniform random variable ξ 2 [0,1) we find
select neuron i , Pi1 ξ < Pi satisfying Prob({Pi1 ξ < Pi}) = |wi|
NVIDIA Confidential 10
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
0 = P0 < P1 < ··· < Pm = 1
– using a uniform random variable ξ 2 [0,1) we find
select neuron i , Pi1 ξ < Pi satisfying Prob({Pi1 ξ < Pi}) = |wi|
– transform jittered equidistant samples using cumulative distribution function of absolute value of weights NVIDIA Confidential 10
Simulating Discrete Densities
Stochastic evaluation of scalar product
⌅ partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
0 = P0 < P1 < ··· < Pm = 1
– using a uniform random variable ξ 2 [0,1) we find
select neuron i , Pi1 ξ < Pi satisfying Prob({Pi1 ξ < Pi}) = |wi|
– transform jittered equidistant samples using cumulative distribution function of absolute value of weights ⌅ in fact derivation of quantization to weights in {1,0,+1} – integer weights if a neuron referenced more than once – explains why ternary and binary did not work in some articles – relation to drop connect and drop out, too NVIDIA Confidential 10
Simulating Discrete Densities
Test accuracy for two layer ReLU feedforward network on MNIST
⌅ able to achieve 97% of accuracy of model by sampling most important 12% of weights!
100 200 300 400 500 600 0.94 0.96 0.98 1 Number of Samples per Neuron Test Accuracy
NVIDIA Confidential 11
Simulating Discrete Densities
Application to convolutional layers
⌅ sample from distribution of filter (for example, 128x5x5 = 3200) – less redundant than fully connected layers ⌅ LeNet Architecture on CIFAR-10, best accuracy is 0.6912 ⌅ able to get 88% of accuracy of full model at 50% sampled NVIDIA Confidential 12
Simulating Discrete Densities
Test accuracy for LeNet on CIFAR-10 500 1,000 1,500 2,000 2,500 3,000 0.4 0.6 Number of Samples per Filter Test Accuracy
NVIDIA Confidential 13
Neural Networks linear in Time and Space
Neural Networks linear in Time and Space
Number n of neural units
⌅ for L fully connected layers
n =
L
∑
l=1
nl where nl is the number of neurons in layer l
NVIDIA Confidential 15
Neural Networks linear in Time and Space
Number n of neural units
⌅ for L fully connected layers
n =
L
∑
l=1
nl where nl is the number of neurons in layer l
⌅ number of weights
nw =
L
∑
l=1
nl1 ·nl
NVIDIA Confidential 15
Neural Networks linear in Time and Space
Number n of neural units
⌅ for L fully connected layers
n =
L
∑
l=1
nl where nl is the number of neurons in layer l
⌅ number of weights
nw =
L
∑
l=1
nl1 ·nl
⌅ choose number of weights per neuron such that n proportional to nw – for example, constant number nw of weights per neuron NVIDIA Confidential 15
Neural Networks linear in Time and Space
Results 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy LeNet on MNIST LeNet on CIFAR-10
NVIDIA Confidential 16
Neural Networks linear in Time and Space
Test accuracy for AlexNet on CIFAR-10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy
NVIDIA Confidential 17
Neural Networks linear in Time and Space
Test accuracy for AlexNet on ILSVRC12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy Top-5 Accuracy Top-1 Accuracy
NVIDIA Confidential 18
Neural Networks linear in Time and Space
Sampling paths through networks
⌅ complexity bounded by number of paths times depth ⌅ strong indication of relation to Markov chains ⌅ importance sampling by weights NVIDIA Confidential 19
Neural Networks linear in Time and Space
Sampling paths through networks
⌅ sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1
NVIDIA Confidential 20
Neural Networks linear in Time and Space
Sampling paths through networks
⌅ sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 a2,1 . . . aL,0 aL,1 aL,2 aL,nL1
NVIDIA Confidential 20
Neural Networks linear in Time and Space
Sampling paths through networks
⌅ sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1
– guaranteed connectivity NVIDIA Confidential 20
Neural Networks linear in Time and Space
Sampling paths through networks
⌅ sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1
– guaranteed connectivity NVIDIA Confidential 20
Neural Networks linear in Time and Space
Sampling paths through networks
⌅ sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1
– guaranteed connectivity NVIDIA Confidential 20
Neural Networks linear in Time and Space
Sampling paths through networks
⌅ sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n01 . . . a1,0 a1,1 a1,2 a1,n11 . . . a2,0 a2,1 a2,2 a2,n21 . . . aL,0 aL,1 aL,2 aL,nL1
– guaranteed connectivity NVIDIA Confidential 20
Neural Networks linear in Time and Space
Test accuracy for 4 layer feedforward network (784/300/300/10) 50 100 150 200 250 300 0.2 0.4 0.6 0.8 1 Number of Per Pixel Paths Through Network Test Accuracy MNIST Fashion MNIST
NVIDIA Confidential 21
Monte Carlo Methods and Neural Networks
Monte Carlo Methods and Neural Networks
Summary
⌅ dropout partitions reduce variance – using much less random numbers NVIDIA Confidential 23
Monte Carlo Methods and Neural Networks
Summary
⌅ dropout partitions reduce variance – using much less random numbers ⌅ simulating discrete densities explains {1,0,1} and integer weights – compression and quantization without retraining NVIDIA Confidential 23
Monte Carlo Methods and Neural Networks
Summary
⌅ dropout partitions reduce variance – using much less random numbers ⌅ simulating discrete densities explains {1,0,1} and integer weights – compression and quantization without retraining ⌅ neural networks with linear complexity for both inference and training – sparse from scratch – sampling paths through neural networks instead of drop connect and drop out NVIDIA Confidential 23