Anima Anandkumar
ROLE OF TENSORS IN ML TRINITY OF AI/ML ALGORITHMS COMPUTE DATA 2 - - PowerPoint PPT Presentation
ROLE OF TENSORS IN ML TRINITY OF AI/ML ALGORITHMS COMPUTE DATA 2 - - PowerPoint PPT Presentation
Anima Anandkumar ROLE OF TENSORS IN ML TRINITY OF AI/ML ALGORITHMS COMPUTE DATA 2 EXAMPLE AI TASK: IMAGE CLASSIFICATION Maple Tree Villa Backyard Plant Potted Plant Garden Swimming Pool Water 3 DATA: LABELED IMAGES FOR TRAINING AI
2
TRINITY OF AI/ML DATA COMPUTE ALGORITHMS
3
EXAMPLE AI TASK: IMAGE CLASSIFICATION
Maple Villa Plant Garden Water Swimming Pool Tree Potted Plant Backyard
4
DATA: LABELED IMAGES FOR TRAINING AI
Picture credits: Image-net.org, ZDnet.com
Ø 14 million images and 1000 categories. Ø Largest database of labeled images. Ø Images in Fish category. Ø Captures variations of fish.
5
MODEL: CONVOLUTIONAL NEURAL NETWORK
.02 .85
p(cat) p(dog) Ø Deep learning: Many layers give large capacity for model to learn from data Ø Inductive bias: Prior knowledge about natural images.
6
DEEP LEARNING: LAYERS OF PROCESSING
Picture credits: zeiler et al
7
MOORE’S LAW: A SUPERCHARGED LAW Ø More than a billion
- perations per image.
Ø NVIDIA GPUs enable parallel operations. Ø Enables Large-Scale AI.
COMPUTE INFRASTRUCTURE FOR AI: GPU
8
1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE
9
HOW GPU ACCELERATION WORKS
GPUs and CPUs Work Together
GPU CPU
5% of Code Uses GPU to Parallelize Compute-Intensive Functions
Rest of Sequential CPU Code
GPU
Many thousand smaller cores (>5,000) Throughput optimized Targets maximum throughput of many threads
CPU
Only a few “fat” cores (8-20 typical) Latency oriented design Targets minimal latency of a single thread
10
PROGRESS IN TRAINING IMAGENET
Statista: Statistics Portal
10 20 30 40 2010 2011 2012 2013 2014 Human 2015
Error in making 5 guesses about the image category
Need Trinity of AI : Data + Algorithms + Compute
11
TENSORS PLAY A CENTRAL ROLE DATA COMPUTE ALGORITHMS
12
TENSOR : EXTENSION OF MATRIX
13
TENSORS FOR DATA ENCODE MULTI-DIMENSIONALITY
Image: 3 dimensions Width * Height * Channels Video: 4 dimensions Width * Height * Channels * Time
Pairwise correlations
E(x ⊗ x)i,j = E(xixj)
<latexit sha1_base64="2eITqAtLsDBvNjunWpDZ+nFc4I=">ACXicbVC7SgNBFL0bXzG+Vi1tRqOQgIRdG7UQgiJYRnBNIAnL7GQSJ5l9MDMrCUtqG3/FxiKleAf2Pkh9k4ehSYeuHA4517uvceLOJPKsr6M1Nz8wuJSejmzsrq2vmFubt3KMBaEOiTkoah4WFLOAuopjitRIJi3+O07HUuhn75ngrJwuBG9SJa93ErYE1GsNKSa+5e5rqoFirmU4m6eTdh+0+OkNadhnqu28a2atgjUCmiX2hGSL+9+DdwAoueZnrRGS2KeBIhxLWbWtSNUTLBQjnPYztVjSCJMObtGqpgHWq+vJ6JU+OtBKAzVDoStQaKT+nkiwL2XP93Snj9WdnPaG4n9eNVbNk3rCgihWNCDjRc2YIxWiYS6owQlivc0wUQwfSsid1hgonR6GR2CPf3yLHGOCqcF+9rOFs9hjDTswB7kwIZjKMIVlMABAg/wBAN4MR6NZ+PVeBu3pozJzDb8gfHxA2vZm1I=</latexit><latexit sha1_base64="Z62pO/bNG/brJtyDqRHgvhM2dx0=">ACXicbVC7SgNBFJ2Nr5j4WLW0GY1CAhJ2bdRCIpgGcE1gWRZiezySzD2ZmQ8KS2sZfsbGIYps/sPNDtHbyKDTxwIXDOfdy7z1uxKiQhvGpZaWV1bX0uZ7Mbm1ra+s/sgwphjYuGQhbzqIkEYDYglqWSkGnGCfJeRitu5HvuVLuGChsG97EfE9lEzoB7FSCrJ0Q9u8j1YDyX1iYC9gpPQk/YAXkIlOxT2nHbB0XNG0ZgALhJzRnKlo6/hqJv9Ljv6R70R4tgngcQMCVEzjUjaCeKSYkYGmXosSIRwBzVJTdEAqdV2MnlAI+V0oBeyFUFEk7U3xMJ8oXo+67q9JFsiXlvLP7n1WLpndsJDaJYkgBPF3kxgzKE41xg3KCJesrgjCn6laIW4gjLFV6GRWCOf/yIrFOixdF87Mla7AFGmwDw5BHpjgDJTALSgDC2DwCJ7BELxqT9qL9qa9T1tT2mxmD/yBNvoBYv+czA=</latexit><latexit sha1_base64="Z62pO/bNG/brJtyDqRHgvhM2dx0=">ACXicbVC7SgNBFJ2Nr5j4WLW0GY1CAhJ2bdRCIpgGcE1gWRZiezySzD2ZmQ8KS2sZfsbGIYps/sPNDtHbyKDTxwIXDOfdy7z1uxKiQhvGpZaWV1bX0uZ7Mbm1ra+s/sgwphjYuGQhbzqIkEYDYglqWSkGnGCfJeRitu5HvuVLuGChsG97EfE9lEzoB7FSCrJ0Q9u8j1YDyX1iYC9gpPQk/YAXkIlOxT2nHbB0XNG0ZgALhJzRnKlo6/hqJv9Ljv6R70R4tgngcQMCVEzjUjaCeKSYkYGmXosSIRwBzVJTdEAqdV2MnlAI+V0oBeyFUFEk7U3xMJ8oXo+67q9JFsiXlvLP7n1WLpndsJDaJYkgBPF3kxgzKE41xg3KCJesrgjCn6laIW4gjLFV6GRWCOf/yIrFOixdF87Mla7AFGmwDw5BHpjgDJTALSgDC2DwCJ7BELxqT9qL9qa9T1tT2mxmD/yBNvoBYv+czA=</latexit><latexit sha1_base64="z0VFJx39qzgBKuwuG9PHNcYZVLk=">ACXicbVDLSgMxFM34rPU16tJNtAgtSJlxoy6EoguKzi20A5DJs20aTPJkGSkZejajb/ixoWKW/AnX9j+lho64ELh3Pu5d57woRpR3n21pYXFpeWc2t5dc3Nre27Z3deyVSiYmHBROyHiJFGOXE01QzUk8kQXHISC3sXY382gORigp+pwcJ8WPU5jSiGkjBfbBdbEPm0LTmCjYLwUZPe4O4QU0ckBhP+iWArvglJ0x4Dxp6QApqgG9lezJXAaE64xQ0o1XCfRfoakpiRYb6ZKpIg3ENt0jCUI7Paz8avDOGRUVowEtIU13Cs/p7IUKzUIA5NZ4x0R816I/E/r5Hq6MzPKE9STieLIpSBrWAo1xgi0qCNRsYgrCk5laIO0girE16eROCO/vyPFOyudl9YtVC6naeTAPjgEReCU1ABN6AKPIDBI3gGr+DNerJerHfrY9K6YE1n9sAfWJ8/pDiYgw=</latexit>Third order correlations
E(x ⊗ x ⊗ x)i,j,k = E(xixjxk)
<latexit sha1_base64="gPrhvw/BpevQcRrtwkbcP7qBM3k=">ACGHicbVBLS0JBGP2uvcxeVs2QxYoiNzbploEUgQtDTIFlcvcdTxzn0wMzeUi3+jTf+jVYtaVLR1w9p3/gSjvweGc8/E9nJAzqUz0gsLC4tryRXU2vrG5tb6e2dWxlEgtAyCXgqg6WlDOflhVTnFZDQbHncFpx3IuRX7mjQrLAv1H9kDY83PZixGstGSnzctsD9UDxTwqUe+H5OyY5bt5d4DOkE7YDPXsri43Z6czZsEcA80Ta0oyxYOvx2cAKNnpYb0ZkMijviIcS1mzFA1YiwUI5wOUvVI0hATF7dpTVMf6wUa8fiyATrUShO1AqHLV2is/u6IsSdl3N0sOqI2e9kfifV4tU6QRMz+MFPXJZFAr4kgFaPQm1GSCEsX7mAimN4VkQ4WmCj9zJR+gjV78jwpHxVOC9a1lSmewRJ2IN9yIFx1CEKyhBGQjcwxO8wpvxYLwY78bHJowpj278AfG8BvB3qFm</latexit><latexit sha1_base64="oVskQI3WosMi/W547O1y7m73EoY=">ACGHicbVDLSsNAFJ34rK2PqEs3g1VoZTEjboQiK4rGBsoS1hMp20wezExKS+hvuPE/XLkRUXHbnR+ia6cPRFsPXDicy734USMCmkYH9rC4tLympqLZ1Z39jc0rd3bkUYc0wsHLKQVx0kCKMBsSVjFQjTpDvMFJxvIuRX+kSLmgY3Mh+RBo+agXUpRhJdm6cZnrwXoqU8E7P2QvJ3QqfgDeAZVAmbwp7dUeXlbT1rFI0x4DwxpyRbOvh8eO5mvsq2Pqw3Qxz7JCYISFqphHJRoK4pJiRQboeCxIh7KEWqSkaILVAIxlfNoCHSmlCN+SqAgnH6u+OBPlC9H1HJX0k2LWG4n/ebVYuieNhAZRLEmAJ4PcmEZwtGbYJNygiXrK4Iwp2pXiNuIyzVM9PqCebsyfPEOiqeFs1rM1s6BxOkwB7YBzlgmNQAlegDCyAwR14BC/gVbvXnrQ37X0SXdCmPbvgD7ThN7kEouA=</latexit><latexit sha1_base64="oVskQI3WosMi/W547O1y7m73EoY=">ACGHicbVDLSsNAFJ34rK2PqEs3g1VoZTEjboQiK4rGBsoS1hMp20wezExKS+hvuPE/XLkRUXHbnR+ia6cPRFsPXDicy734USMCmkYH9rC4tLympqLZ1Z39jc0rd3bkUYc0wsHLKQVx0kCKMBsSVjFQjTpDvMFJxvIuRX+kSLmgY3Mh+RBo+agXUpRhJdm6cZnrwXoqU8E7P2QvJ3QqfgDeAZVAmbwp7dUeXlbT1rFI0x4DwxpyRbOvh8eO5mvsq2Pqw3Qxz7JCYISFqphHJRoK4pJiRQboeCxIh7KEWqSkaILVAIxlfNoCHSmlCN+SqAgnH6u+OBPlC9H1HJX0k2LWG4n/ebVYuieNhAZRLEmAJ4PcmEZwtGbYJNygiXrK4Iwp2pXiNuIyzVM9PqCebsyfPEOiqeFs1rM1s6BxOkwB7YBzlgmNQAlegDCyAwR14BC/gVbvXnrQ37X0SXdCmPbvgD7ThN7kEouA=</latexit><latexit sha1_base64="LQ+wCP+FitJXNyBgl/s1JuAkFHI=">ACGHicbVBNS8NAEN3Ur1q/oh69LBahVISL+pBKIrgsYKxhTaEzXbTbrvZhN2NtIT+DS/+FS8eVLz25r9x2wbR1gcDj/dmJnx4xKZVlfRm5ldW19I79Z2Nre2d0z9w8eZJQITBwcsUg0fSQJo5w4ipGmrEgKPQZafiD6nfeCRC0ojfq1FM3B1OQ0oRkpLnmndlIawHSkaEgmHP6TspbTSrwzG8BLqDo/CodfXNSh7ZtGqWjPAZWJnpAgy1D1z0u5EOAkJV5ghKVu2FSs3RUJRzMi40E4kiREeoC5pacqRPsBNZ5+N4YlWOjCIhC6u4Ez9PZGiUMpR6OvOEKmeXPSm4n9eK1HBuZtSHieKcDxfFCQMqghOY4IdKghWbKQJwoLqWyHuIYGw0mEWdAj24svLxDmtXlTtO7tYu8rSyIMjcAxKwAZnoAZuQR04AIMn8ALewLvxbLwaH8bnvDVnZDOH4A+MyTf6PZ6X</latexit>TENSORS FOR ML ALGORITHMS ENCODE HIGHER ORDER MOMENTS
15
TENSORS FOR MODELS STANDARD CNN USE LINEAR ALGEBRA
16
Jean Kossaifi, Zack Chase Lipton, Aran Khanna, Tommaso Furlanello, A Jupyters notebook: https://github.com/JeanKossaifi/tensorly-notebooks
TENSORS FOR MODELS TENSORIZED NEURAL NETWORKS
17
SPACE SAVING IN DEEP TENSORIZED NETWORKS
18
TENSORS FOR LONG-TERM FORECASTING
Difficulties in long term forecasting:
- Long-term dependencies
- High-order correlations
- Error propagation
18
RNNS: FIRST-ORDER MARKOV MODELS
Input state 𝑦#, hidden state ℎ#, output 𝑧#, ℎ#= 𝑔 𝑦#, ℎ#)*;𝜄 ; 𝑧#= ( ℎ#;𝜄)
TENSOR-TRAIN RNNS AND LSTMS
Seq2seq architecture TT-LSTM cells
C l i m a te d a t a s et T r a f f i c d a t a s et
TENSOR LSTM FOR LONG-TERM FORECASTING
22
UNSUPERVISED LEARNING TOPIC MODELS THROUGH TENSORS
Justice Education Sports Topics
23
TENSORS FOR MODELING: TOPIC DETECTION IN TEXT
Co-occurrence
- f word triplets
T
- pic 1
T
- pic 2
TENSOR-BASED TOPIC MODELING IS FASTER
- Mallet is an open-source framework for topic modeling
- Benchmarks on AWS SageMaker Platform
- Bulit into AWS Comprehend NLP service.
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 5 10 15 20 25 30 50 75 100
Time in minutes Number of Topics Training time for NYTimes Spectral Time(minutes) Mallet Time (minutes)
0.00 50.00 100.00 150.00 200.00 250.00 5 10 15 20 25 50 100
Time in minutes Number of Topics Training time for PubMed Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
300000 documents
25
T E N S O R L Y : H I G H - L E V E L A P I F O R T E N S O R A L G E B R A
- Python programming
- User-friendly API
- Multiple backends:
flexible + scalable
- Example notebooks in
repository
TENSORLY WITH PYTORCH BACKEND
import tensorly as tl from tensorly.random import tucker_tensor tl.set_backend(‘pytorch’) core, factors = tucker_tensor((5, 5, 5), rank=(3, 3, 3)) core = Variable(core, requires_grad=True) factors = [Variable(f, requires_grad=True) for f in factors]
- ptimiser
= torch.optim.Adam([core]+factors, lr=lr) for i in range(1, n_iter):
- ptimiser.zero_grad()
rec = tucker_to_tensor(core, factors) loss = (rec - tensor).pow(2).sum() for f in factors: loss = loss + 0.01*f.pow(2).sum() loss.backward()
- ptimiser.step()
Set Pytorch backend Attach gradients Set optimizer Tucker Tensor form
27
Extends the notion of matrix product Matrix product Mv =
- j
vjMj
=
+
Tensor Contraction T(u, v, ·) =
- i,j
uivjTi,j,:
=
+ + +
TENSORS FOR COMPUTE TENSOR CONTRACTION PRIMITIVE
TENSOR PRIMITIVES?
- 1969 – BLAS Level 1: Vector-Vector
- 1972 – BLAS Level 2: Matrix-Vector
- 1980 – BLAS Level 3: Matrix-Matrix
- Now? – BLAS Level 4: Tensor-Tensor
History & Future
= 𝛽 + = ∗ = ∗ = ∗
Better Hardware utilization More complex data acceses
Kim, Jinsung, et al. "Optimizing Tensor Contractions in CCSD (T) for Efficient Execution on GPUs." (2018).
SPEEDING UP TENSOR CONTRACTIONS
Explicit permutation dominates, especially for small tensors. Consider Cmnp = Akm Bpkn.
1 Akm → Amk 2 Bpkn → Bkpn 3 Cmnp → Cmpn 4
Cm(pn) = Amk Bk(pn)
5 Cmpn → Cmnp
100 200 300 400 500 0.2 0.4 0.6 0.8 1 n
(Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions. Lines are shown with 1, 2, 3, and 6 transpositions.
NEW PRIMITIVE FOR TENSOR CONTRACTIONS
C[p] = α op(A[p]) op(B[p]) + β C[p] Pointer-to-Pointer BatchedGEMM requires memory allocation and pre-computation. Solution: StridedBatchedGEMM with fixed strides.
I Special case of Pointer-to-pointer BatchedGEMM. I No Pointer-to-pointer data structure or overhead.
cublas<T>gemmStridedBatched(cublasHandle_t handle, cublasOperation_t transA, cublasOperation_t transB, int M, int N, int K, const T* alpha, const T* A, int ldA1, int strideA, const T* B, int ldB1, int strideB, const T* beta, T* C, int ldC1, int strideC, int batchCount)
PERFORMANCE OF STRIDEDBATCHEDGEMM
Performance on par with Pure GEMM
32
REALITIES OF DATA
A FEW OTHER RESEARCH THREADS..
34
GPU ADOPTION BARRIERS
- Too much data movement
- Too many makeshift data
formats
- Writing CUDA C/C++ is hard
- No Python API for data
manipulation Yes GPUs are fast but …
35
APP A
DATA MOVEMENT AND TRANSFORMATION
The bane of productivity and performance
CPU GPU
APP B Read Data Copy & Convert Copy & Convert Copy & Convert Load Data APP A
GPU Data
APP B
GPU Data
APP A APP B
36
APP A
DATA MOVEMENT AND TRANSFORMATION
What if we could keep data on the GPU?
APP B Copy & Convert Copy & Convert Copy & Convert APP A
GPU Data
APP B
GPU Data
Read Data Load Data APP B
CPU GPU
APP A
37
RE-IMAGINING DATA SCIENCE WORKFLOW
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
Data preparation / wrangling
cuDF
Optimized ML model training
cuML Visualization
Data visualization libraries data insights
38
cuDF Analytics GPU Memory Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch & Chainer Deep Learning Kepler.GL Visualization
NVIDIA RAPIDS OPEN SOURCE SOFTWARE
ML Pipelines with both classical ML and Deep Learning
39
RAPID AI LIBRARIES
8x V100 20-90x faster than dual socket CPU
Decisions Trees Random Forests Linear Regressions Logistics Regressions K-Means K-Nearest Neighbor DBSCAN Kalman Filtering Principal Components Single Value Decomposition Bayesian Inferencing PageRank BFS Jaccard Similarity Single Source Shortest Path Triangle Counting Louvain Modularity ARIMA Holt-Winters
Machine Learning Graph Analytics
Time Series
XGBoost, Mortgage Dataset, 90x 3 Hours to 2 mins on DGX-1
cuML & cuGraph
40
Faster Data Access Less Data Movement
DATA PROCESSING EVOLUTION
25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read
Query ETL ML Train
HDFS Read
Query ETL ML Train
HDFS Read
GPU ReadQuery CPU Write GPU Read ETL CPU Write GPU Read ML Train
Arrow Read
Query ETL ML Train
5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU
RAPIDS GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
41
CUDF + XGBOOST
DGX-2 vs Scale Out CPU Cluster
- Full end to end pipeline
- Leveraging Dask + PyGDF
- Store each GPU results in sys mem
then read back in
- Arrow to Dmatrix (CSR) for XGBoost
42
CUDF + XGBOOST
Fully In- GPU Benchmarks
- Full end to end pipeline
- Leveraging DaskGDF
- No Data Prep time all in memory
- Arrow to Dmatrix (CSR) for XGBoost
43
- https://ngc.nvidia.com/registry/nvidia-
rapidsai-rapidsai
- https://hub.docker.com/r/rapidsai/rapidsai/
- https://github.com/rapidsai
- https://anaconda.org/rapidsai/
- WIP:
- https://pypi.org/project/cudf
- https://pypi.org/project/cuml
RAPIDS
How do I get the software?
44
“ Mens sana in corpore sano.”
Juvenal in Satire X.
A New Vision for Autonomy
Center for Autonomous Systems and Technologies
46
Explorers
Planetary, Underwater, and Space Explorers
Transformers
Swarms of Robots Transforming Shapes and Functions
Guardians
Dynamic Event Monitors and First Responders
Transporters
Robotic Flying Ambulances and Delivery Drones
Partners
Robots Helping and Entertaining People
MOONSHOTS
47
CAST @ CALTECH DRONE TESTING LAB
48
CAST @ CALTECH LEARNING TO LAND
49
REVOLUTIONIZING MANUFACTURING AND LOGISTICS
50
NVIDIA ISAAC — WHERE ROBOTS GO TO LEARN
51
NVIDIA REVOLUTIONIZING TRANSPORTATION
52
Cars Pedestrians Path Lanes Signs Lights Cars Pedestrians Path Lanes Signs Lights
- 1. COLLECT & PROCESS DATA
- 2. TRAIN MODELS
- 3. SIMULATE
- 4. DRIVE
NVIDIA DRIVE FROM TRAINING TO SAFETY
53
MIND & BODY NEXT-GENERATION AI
Fine-grained reactive Control
Instinctive:
Making and adapting plans
Deliberative:
Sense and react to human
Behavioral:
Acting for the greater good
Multi-Agent:
54