[PPT] - P Y T O R C H A N D T H E N E W C H A L L E N G E S O F M L PowerPoint Presentation

SLIDE 1

P Y T O R C H A N D T H E N E W C H A L L E N G E S O F M L

SLIDE 2 20 37 43 72 118 137 155 167 195 264 315 318 495 762 1469 2472 3711 5371 1,200 2,400 3,600 4,800 6,000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

Citation Count Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998

LeCun’s Law and the Rise of Deep Learning

SLIDE 3

TRANSLATION SPARK AR OCULUS VR BLOOD DONATIONS

SLIDE 4

400T+

PREDICTIONS PER DAY

SLIDE 5

1B+

PHONES RUNNING NEURAL NETS GLOBALLY

SLIDE 6

SIMPLICITY OVER COMPLEXITY HARDWARE ACCELERATED INFERENCE DISTRIBUTED TRAINING DYNAMIC NEURAL NETWORKS EAGER & GRAPH-BASED EXECUTION

WHAT IS PYTORCH?

SLIDE 7

BUILT BY THE COMMUNITY BUILT FOR PRODUCTION DESIGNED FOR RESEARCHERS

SLIDE 8

BUILT BY THE COMMUNITY BUILT FOR PRODUCTION DESIGNED FOR RESEARCHERS

SLIDE 9

~1,200

C O N T R I B U T O R S

50%+

Y O Y G R O W T H

22K

P Y T O R C H F O R U M U S E R S

SLIDE 10

BUILT BY THE COMMUNITY BUILT FOR PRODUCTION DESIGNED FOR RESEARCHERS

SLIDE 11

GROWTH IN ARXIV MENTIONS IN RESEARCH PAPERS

SLIDE 12

16K+

S T U D E N T S E N R O L L E D I N C O U R S E S

21M

M I N U T E S O F W A T C H T I M E I N T H E L A S T 1 2 M O N T H S U D A C I T Y F A S T . A I

Practical Deep Learning for Coders, V3 Part 2: Deep Learning from the Foundations Introduction to Machine Learning for Coders A Code-First Introduction to Natural Language Processing

SLIDE 13

BUILT BY THE COMMUNITY BUILT FOR PRODUCTION DESIGNED FOR RESEARCHERS

SLIDE 14

RESEARCH PRODUCTION

SLIDE 15

SLIDE 16 P Y T O R C H

SLIDE 17 P Y T O R C H

SLIDE 18 P Y T O R C H

SLIDE 19 P Y T O R C H

SLIDE 20

C O R E P R I N C I P L E S

BUILDING FOR SCALE DEVELOPER EFFICIENCY

SLIDE 21

DEVELOPER EFFICIENCY

ENABLING A HIGH VELOCITY OF MODEL ITERATION AND INNOVATION

SLIDE 22

`

C L E A N A P I S

SLIDE 23

`

Today, we name and access dimensions by comment: # Tensor[N, C, H, W] images = torch.randn(32, 3, 56, 56) images.sum(dim=1) images.select(dim=1, index=0) But naming explicitly leads to more readable and maintainable code: NCHW = [‘N’, ‘C’, ‘H’, ‘W’] images = torch.randn(32, 3, 56, 56, names=NCHW)

N A M E D T E N S O R S

EXPERIMENTAL

SLIDE 24

`

T O R C H S C R I P T

Models are Python TorchScript programs, an optimizable subset of Python + Same “models are programs” idea + Production deployment + No Python dependency + Compilation for performance

ptimization

class RNN(nn.Module): def __init__(self, W_h, U_h, W_y, b_h, b_y): super(RNN, self).__init__() self.W_h = nn.Parameter(W_h) self.U_h = nn.Parameter(U_h) self.W_y = nn.Parameter(W_y) self.b_h = nn.Parameter(b_h) self.b_y = nn.Parameter(b_y) def forward(self, x, h): y = [] for t in range(x.size(0)): h = torch.tanh(x[t] @ self.W_h + h @ self.U_h + self.b_h) y += [torch.tanh(h @ self.W_y + self.b_y)] if t % 10 == 0: print("stats: ", h.mean(), h.var()) return torch.stack(y), h # one annotation! script_rnn = torch.jit.script(RNN(W_h, U_h, W_y, b_h, b_y))

SLIDE 25

C O R E P R I N C I P L E S

BUILDING FOR SCALE DEVELOPER EFFICIENCY

SLIDE 26

BUILDING FOR SCALE

HIGH PERFORMANCE EXECUTION FOR MODEL TRAINING AND INFERENCE

SLIDE 27

30% 50%

FB data used in an ML pipeline TODAY FB data used in an ML pipeline in 2018

3X

ML Data Growth in One Year

G R O W T H O F D A T A I N M L P I P E L I N E S

SLIDE 28

WORKFLOWS TRAINED RANKING ENGINEERS COMPUTE CONSUMED

3X

INCREASE

2X

INCREASE

3X

INCREASE

S C A L E O F M L T R A I N I N G A T F A C E B O O K

SLIDE 29 PYTORCH DEVELOPMENT ENV PYTORCH JIT

O P T I M I Z I N G F O R H A R D W A R E B A C K E N D S

MKL-DNN Cuda/CuDNN (Q)NNPACK FBGEMM XLA Glow TVM

SLIDE 30 Bryce Canyon (70X HDDs + Integrated Compute) Big Basin (8X GPU + 2X CPU) Twin Lakes (Single socket CPU card, Low Mem)

Feature Engineering

1

Training

2

Inference

3

Lightning (30X Flash Drives JBOF) Tioga Pass (Dual CPU, High Mem) Tioga Pass

sxm2

SLIDE 31

Q U A N T I Z A T I O N

Efficient inference on server and mobile devices using reduced precision math.

SIMPLICITY OF USE ACCURACY & PERF CONTROL DYNAMIC QUANTIZATION POST TRAINING QUANTIZATION QUANTIZATION AWARE TRAINING

4x 2-4x

LESS MEMORY COMPUTE SPEEDUP

SLIDE 32

P Y T O R C H R E S E A R C H P R O T O T Y P I N G P R O D U C T I O N D E P L O Y M E N T +

SLIDE 33

NAMED TENSORS

SLIDE 34

PyTorch set the bar for ML Developer UX by focusing on expressivity and productivity "I want to write a program, not to (manually) build a graph" Where are similar areas for improvement today?

SLIDE 35

Data has semantic meaning!

But we force users to drop that context and use an abstract "Tensor" mathematical object

Type to enter a caption.

SLIDE 36

Key Insight: Named Dimensions

Inspired by and done in collaboration with Prof. Alexander Rush, now Cornell Tech.

SLIDE 37

Key Insight: Named Dimensions

Today we name and access dimensions by comment

SLIDE 38

Today we name and access dimensions by comment But naming explicitly leads to more readable and maintainable code

Key Insight: Named Dimensions

SLIDE 39

By retaining semantic meaning, we also avoid common "Tensor Pitfalls"

Accidental Broadcasting
Accidental Alignment

SLIDE 40

By retaining semantic meaning, we also avoid common "Tensor Pitfalls"

Accidental Broadcasting
Accidental Alignment

SLIDE 41

Accidental Broadcasting

We didn't expect broadcasting to happen, but it did:

SLIDE 42

Accidental Broadcasting

We didn't expect broadcasting to happen, but it did: We can catch this automatically!

SLIDE 43

Accidental Broadcasting

We didn't expect broadcasting to happen, but it did: Broadcast by position, but check that dimension names are aligned. We can catch this automatically!

SLIDE 44

By retaining semantic meaning, we also avoid common "Tensor Pitfalls"

Accidental Broadcasting
Accidental Alignment

SLIDE 45

Accidental Alignment

No 1->N broadcast occurs across semantically distinct dimensions, but size happens to match.

SLIDE 46

Accidental Alignment

No 1->N broadcasting occurs across semantically distinct dimensions, but size happens to match. But there are so many formats!

SLIDE 47

Accidental Alignment

No 1->N broadcasting occurs across semantically distinct dimensions, but size happens to match. But there are so many formats! There is a "time bomb" if I ever normalize the wrong format and the "unaligned" dimensions have the same size!

SLIDE 48

Accidental Alignment

No 1->N broadcasting occurs across semantically distinct dimensions, but size happens to match.

SLIDE 49

Accidental Alignment

No 1->N broadcasting occurs across semantically distinct dimensions, but size happens to match. If we broadcast by name (align_as), we only need a single normalize function for all formats

SLIDE 50

What about mixing named and unnamed Tensors? I don't want to convert my entire program at once...

SLIDE 51

Coexistence with Unnamed

Named Tensors can coexist with Unnamed Tensors. Let's remove the requirement that mean, stdv are named

SLIDE 52

Coexistence with Unnamed

Named Tensors can coexist with Unnamed Tensors. Let's remove the requirement that mean, stdv are named refine_names lifts unnamed tensors to be named tensors

SLIDE 53

Experimental in 1.3

N a m e d T e n s o r s C o r e F u n c t i o n a l i t y

Common torch operators are supported in eager mode (Unnamed) autograd is supported

T u t o r i a l

See our in-depth MultiheadedAttention tutorial

E x p a n d e d C o v e r a g e

Expanded NN package coverage Named autograd support Serialization, multiprocessing, distributed, JIT, mypy

Future Work

SLIDE 54

P y T o r c h J I T / T o r c h S c r i p t

SLIDE 55

A compiler and language infrastructure for machine learning

What is the PyTorch JIT?

SLIDE 56 P O R T A B I L I T Y Models should run anywhere P E R F O R M A N C E Whole-program optimization Production Requirements

SLIDE 57

We need a system that can:

1. Capture the structure of PyTorch programs.
2. Use that structure to optimize.

Problem Statement

SLIDE 58

We need a system that can:

1. Capture the structure of PyTorch programs.

TorchScript

2. Use that structure to optimize.

JIT Compiler

Problem Statement

SLIDE 59

TorchScript

A static, high-performance subset of Python.

1. Prototype your model with PyTorch
2. Control flow is preserved
3. First-class support for lists, dicts, etc.

SLIDE 60

PyTorch JIT

An optimizing just-in-time compiler for PyTorch programs.

1. Lightweight, thread-safe interpreter
2. Easy to write custom transformations
3. Not just for inference! Autodiff support.

SLIDE 61

Recursive Neural Network Grammars

C A S E S T U D Y

— Complex dynamic behavior based on the inputs — Typically written in pure C++

SLIDE 62

SLIDE 63

SLIDE 64

Complex Control Flow

SLIDE 65

Use common data structures

SLIDE 66

Define your own classes

SLIDE 67

Define your own classes

SLIDE 68

JIT as a Platform

W H A T ' S N E X T ? Q U A N T I Z A T I O N

Model quantization done safely and automatically using JIT transformations.

M O B I L E

A lightweight interpreter that can run on-device.

B A C K E N D S

Support for lowering models to static graph compilers, like TVM, Glow, XLA.

SLIDE 69

Q U A N T I Z A T I O N

SLIDE 70

— Neural networks inference is expensive — IoT and mobile devices with limited resources — Design models for efficient inference at scale

S Y S T E M - M O D E L C O - D E S I G N

Give tools for building and running efficient models

OUR MISSION

SLIDE 71

QUANTIZATION

Can neural networks run in lower precision? float16, int8 Supported by modern hardware x86 CPU, ARM CPU, NVIDIA Volta & Turing, Qualcomm DSP, … Maintaining accuracy is hard Working approaches, ongoing research

● ● ●

N × float32 N × uint8

4x 2-4x

less memory compute speedup

● ● ●

scale

float32

zero_point

int32

float_val = (uint8_val - zero_point) × scale

SLIDE 72 T U R N - K E Y W O R K F L O W S Dynamic quantization Post training quantization Quantization aware training C O M P O N E N T S F O R T U N I N G A N D R E S E A R C H Every part of the workflow is flexible Use or build your own (in PyTorch) C O R E S U P P O R T Quantized tensor and operations Optimized kernels for x86 and ARM CPUs (other backends coming) PYTORCH QUANTIZATION

SLIDE 73

W O R K F L O W S

Quantization Dataset Requirements Works Best For Accuracy Dynamic Quantization weights only small batch LSTMs and MLPs good Post Training Quantization weights and activations calibration CNNs good Quantization-aware Training weights and activations fine-tuning all best Or build your own!

SLIDE 74

WORKFLOW: DYNAMIC QUANTIZATION

How: one line API
What: quantize weights once, activations at runtime
Good for LSTMs and MLPs with small batch size
Savings 2x faster compute, 4x less memory

nnqd.Linear

W

int8

bias

float

X

float

Y

float

# load or train your model model = WordLanguageModel() model.load_state_dict(torch.load("model.pt")) # quantize qmodel = quantize_dynamic(model, dtype=torch.quint8) # use or deploy for C++ inference

utput = qmodel(input)

torch.jit.script(qmodel).save("scripted.pt")

SLIDE 75

WORKFLOW: POST TRAINING

How: tweak model, calibrate on data, convert
What: quantize weight and activations for entire

model or submodules

Good for CNNs (if the accuracy drop is acceptable)
Savings 1.5-2x faster compute, 4x less memory

nnq.Conv

W

int8

bias

float

X

uint8

Y

uint8 Conv2d

W

float

bias

float

X

float

Y

float

bserve

r

bserve

r

ut

qparams

calibrate quantize

SLIDE 76

WORKFLOW: POST TRAINING

How: tweak model, calibrate on data, convert
What: quantize weight and activations for entire

model or submodules

Good for CNNs (if the accuracy drop is acceptable)
Savings 1.5-2x faster compute, 4x less memory

# load or train your model model = ResNet50() model.load_state_dict(torch.load("model.pt")) # tweak model for best results # change code directly or use manipulation APIs model = quantization.fuse_modules(model, [["conv1", "bn1", "relu1"]])

SLIDE 77

WORKFLOW: POST TRAINING

How: tweak model, calibrate on data, convert
What: quantize weight and activations for entire

model or submodules

Good for CNNs (if the accuracy drop is acceptable)
Savings 1.5-2x faster compute, 4x less memory

# load or train your model model = ResNet50() model.load_state_dict(torch.load("model.pt")) # tweak model for best results # change code directly or use manipulation APIs model = quantization.fuse_modules(model, [["conv1", "bn1", "relu1"]]) # specify which part to quantize and how qmodel = quantization.prepare(model, {"": quantization.default_qconfig}) # configurable! # collect calibration statistics qmodel.eval() for batch, target in data_loader: model(batch) print(model.conv1) ConvReLU2d(3, 64, kernel_size=(7, 7), ... (observer): MinMaxObserver( min_val=0.0, max_val=4.55) )

SLIDE 78

WORKFLOW: POST TRAINING

How: tweak model, calibrate on data, convert
What: quantize weight and activations for entire

model or submodules

Good for CNNs (if the accuracy drop is acceptable)
Savings 1.5-2x faster compute, 4x less memory

# tweak model for best results # change code directly or use manipulation APIs model = quantization.fuse_modules(model, [["conv1", "bn1", "relu1"]]) # specify which part to quantize and how qmodel = quantization.prepare(model, {"": quantization.default_qconfig}) # configurable! # collect calibration statistics qmodel.eval() for batch, target in data_loader: model(batch) # get the quantized model qmodel = quantization.convert(qmodel) print(model.conv1) QuantizedConvReLU2d(3, 64, scale=0.035, zero_point=0, kernel_size=(7, 7), ...)

SLIDE 79

WORKFLOW: POST TRAINING

How: tweak model, calibrate on data, convert
What: quantize weight and activations for entire

model or submodules

Good for CNNs (if the accuracy drop is acceptable)
Savings 1.5-2x faster compute, 4x less memory

[["conv1", "bn1", "relu1"]]) # specify which part to quantize and how qmodel = quantization.prepare(model, {"": quantization.default_qconfig}) # configurable! # collect calibration statistics qmodel.eval() for batch, target in data_loader: model(batch) # get the quantized model qmodel = quantization.convert(qmodel) # use or deploy for C++ inference qmodel(input) torch.jit.script(qmodel).save(“quantized.pt”)

SLIDE 80

SOON: JIT TO SIMPLIFY PREPARATION

Structural tweaks for TorchScript models automatically: fusion, batch norm folding, etc Status: coming in 1.4, check nightlies

model = torch.jit.script(model) # tweak model for best results # change code or use manipulation APIs model = quantization.fuse_modules(model, [["conv1", "bn1", "relu1"]]) qmodel = quantization.prepare_script(model, {"": quantization.default_qconfig}) ... qmodel = quantization.convert_script(qmodel) qmodel.save(“quantized.pt")

SLIDE 81

PYTORCH AT CORE

Same framework, no conversion
Same serialization
Python or TorchScript
Eager at its core
Most logic is in python
Extensibility, debuggers, stack traces
Extensible API
New layers
Observers
Quantization techniques
Partial quantization

torch.quantization.* torch.quantization.Observer torch.quantization.FakeQuant torch.nn.quantized.* torch.nn.quantized.dynamic.* torch.quantize_per_tensor torch.quantize_per_channel

SLIDE 82

FRAMEWORK SUPPORT

Basic support - enough for CNNs and RNNs
Backends
x86 CPU in 1.3 (via FBGEMM)
ARM CPU early alpha (QNNPACK)
In 1.4
Broader ops coverage
CUDA support
API simplification for JIT models

view max_pool2d avg_pool2d clone resize slice Conv2d Linear RNN LSTM * topk sort upsample_nearest2d interpolate + relu max

SLIDE 83

Experimental in 1.3

T R Y I T N O W Q u a n t i z a t i o n c o r e a n d w o r k f l o w s

Post training, dynamic and quantization-aware training x86 and ARM CPU Backends

M o r e b a c k e n d s a n d J I T w o r k f l o w

Simpler workflow for TorchScript Expanding operator coverage

Q u a n t i z e d m o d e l s a n d t u t o r i a l s t o o b t a i n t h e m

ResNet50 ResNext-101 InceptionV3 MobileNetV2 BERT … more to come

Coming in 1.4 Available in torch.hub

SLIDE 84

pytorch.org

P Y T O R C H 1 . 3

SLIDE 85

T H A N K Y O U ！

SLIDE 86

T H A N K Y O U ！

SLIDE 87

T H A N K Y O U ！