[PPT] - TensorFlow Marco Serafini COMPSCI 532 Lecture 20 Motivations PowerPoint Presentation

SLIDE 1

TensorFlow

Marco Serafini

COMPSCI 532 Lecture 20

SLIDE 2

SLIDE 3

3

Motivations

DistBelief: Previous iteration
Parameter server
Limitations:
Monolithic layers, difficult to define new ones
Difficult to offload computation with complex dependencies to

parameter servers

E.g. Apply updates based on gradients accumulated over multiple iterations
Fixed execution pattern
Read data, compute loss function (forward pass), compute gradients for

parameters (backward pass), write gradients to parameter server

Not optimized for single workstations and GPUs

SLIDE 4

44

TensorFlow

Dataflow graph of operators, but not a DAG
Loops and conditionals
Deferred (lazy) execution
Enables optimizations, e.g. pipelining with GPU kernels
Composable basic operators
Matrix multiplication, convolution, ReLu
Concept of devices
CPUs, GPUs, mobile devices
Different implementations of the operators

SLIDE 5

55

Difference with Parameter Server

Parameter server
Separate worker nodes and parameter nodes
Different interfaces
TensorFlow: only tasks
Shared parameters (called operators): variables and queues
Tasks managing them are called PS tasks
PS task are regular tasks: they can run arbitrary operators
Uniform programming interface

SLIDE 6

66

Example

b_1 stateful operators stateful operators

SLIDE 7

77

Example

Data-parallel training looks like this

Stateful queues Stateful variables Concurrent steps for data parallelism

SLIDE 8

88

Dataflow Graph

Vertex: unit of local computation
Called operation in TensorFlow
Edges: inputs and outputs of computation
Values along edges are called tensors

SLIDE 9

99

Tensors

Edges in dataflow graph
Data flowing among operators
Format
n-dimensional arrays
Elements have primitive types (including byte arrays)
Tensors are dense
All elements are represented
User must find ways to encode sparse data efficiently

SLIDE 10

10

Operations

Vertices in dataflow graph
State is encapsulated in operations
Variables and queues
Access to state (and tensors)
Variable op: Returns unique reference handle
Read op: Take reference handle, produce value of variable
Write ops: Take reference and value and update.
Queues are also stateful operators
Get reference handle, modify through operations
Blocking semantics, backpressure, synchronization

SLIDE 11

11

Execution Model

Step: client executes a subgraph by indicating:
Edges to feed the subgraph with input tensors
Edges to fetch the output tensors
Runtime prunes the subgraph to remove unnecessary steps
Subgraphs are run asynchronously by default
Can execute multiple partial, concurrent subgraphs
Example: concurrent batches for data-parallel training

SLIDE 12

12

Distributed Execution

Tasks: named processes that send messages
PS tasks: store variables, but can also run computations
Worker tasks: the rest
Note: “informal” categories, not enforced by TensorFlow
Devices: CPU, GPU, TPU, mobile, …
CPU is the host device
Device executes kernel for each operation assigned to it
Same operation (e.g. matrix multiplication) has different kernels for different devices
Requirements for a device
Must accept kernel for execution
Must allocate memory for inputs and outputs
Must transfer data to and from host memory

SLIDE 13

13

Distributed Execution

Each operation
Resides on a device
Corresponds to one or more kernel
More kernel can be specialized for different devices
Operations are executed within a task

SLIDE 14

14

Distributed Scheduling

TensorFlow runtime places operations on devices
Implicit constraints: stateful operation on same device as state
Explicit constraints: dictated by the user
Optimal placement still open question
Obtain per-device subgraphs
All operations assigned to device
Send and Receive operations to replace edges
Specialized per-device implementations
CPU – GPU: CUDA memory copy
Across tasks: TCP or RDMA
Placement preserved throughout session

SLIDE 15

15

Dynamic Control Flow

How do enable dynamic control flow with static graph?
Example: recurrent neural network
Train network for sequence of variable length without unrolling
Conditional: Switch and Merge

Switch

Data input Control input

p
p
p
p

Merge

input dead Output one non-dead input

SLIDE 16

16

Loops

Uses three additional operators

Enter

Data input

p
p

Exit NextIteration

SLIDE 17

17

Scaling to Large Models

Model parallelism
Avoids moving terabytes of parameters every time
Operations (typically implemented by library)
Gather: reads tensor data from shard and computes
Part: Partitions the input across shards of parameters
Stitch: Aggregates all partitions

parameters inputs

SLIDE 18

18

Fault Tolerance

Long running tasks face failures and pre-emption
Sometimes run at night on idle machines
Small operations, no need to tolerate individual failures
Even RDDs are overkill
User use Save operation for checkpointing
Each variable in a task connected to same save for batching
Asynchronous, not consistent
Restore operation executed by clients at startup
Other use cases: transfer learning

SLIDE 19

19

Coordination

TensorFlow is asynchronous by default
Stochastic Gradient Descent tolerates asynchrony
Asynchrony increases throughput
But synchrony has benefits
Using stale parameters slows down convergence
System must support user-defined synchrony

SLIDE 20

20

Synchronous Coordination

Use blocking queues for synchrony
Redundant tasks for stragglers

blocking queues on inputs and outputs different colors = different versions

f parameter

proactive (not reactive) backup workers

SLIDE 21

21

Implementation

Distributed master
Obtain subgraphs for

each participating device

Dataflow executor
Handles requests from

master

Schedules the execution
f the kernels of local

subgraph

Data transfer to device

and over network

SLIDE 22

22

Single-Machine Performance

Similar to COST analysis
Comparison with single-server (not single-threaded) tools
Four convolutional models using one GPU

SLIDE 23

23

Synchronous Microbenchmarks

Null training steps
Sparse performance is close to optimal (scalar)

SLIDE 24

24

Scalability

Scalability bound by access to PS tasks (7 in the exp)
Synchronous coordination scales well
Backups are beneficial (but expensive way to do FT)