TensorFlow Marco Serafini COMPSCI 532 Lecture 20 Motivations - - PowerPoint PPT Presentation

tensorflow
SMART_READER_LITE
LIVE PREVIEW

TensorFlow Marco Serafini COMPSCI 532 Lecture 20 Motivations - - PowerPoint PPT Presentation

TensorFlow Marco Serafini COMPSCI 532 Lecture 20 Motivations DistBelief: Previous iteration Parameter server Limitations: Monolithic layers, difficult to define new ones Difficult to offload computation with complex


slide-1
SLIDE 1

TensorFlow

Marco Serafini

COMPSCI 532 Lecture 20

slide-2
SLIDE 2
slide-3
SLIDE 3

3

3

Motivations

  • DistBelief: Previous iteration
  • Parameter server
  • Limitations:
  • Monolithic layers, difficult to define new ones
  • Difficult to offload computation with complex dependencies to

parameter servers

  • E.g. Apply updates based on gradients accumulated over multiple iterations
  • Fixed execution pattern
  • Read data, compute loss function (forward pass), compute gradients for

parameters (backward pass), write gradients to parameter server

  • Not optimized for single workstations and GPUs
slide-4
SLIDE 4

44

TensorFlow

  • Dataflow graph of operators, but not a DAG
  • Loops and conditionals
  • Deferred (lazy) execution
  • Enables optimizations, e.g. pipelining with GPU kernels
  • Composable basic operators
  • Matrix multiplication, convolution, ReLu
  • Concept of devices
  • CPUs, GPUs, mobile devices
  • Different implementations of the operators
slide-5
SLIDE 5

55

Difference with Parameter Server

  • Parameter server
  • Separate worker nodes and parameter nodes
  • Different interfaces
  • TensorFlow: only tasks
  • Shared parameters (called operators): variables and queues
  • Tasks managing them are called PS tasks
  • PS task are regular tasks: they can run arbitrary operators
  • Uniform programming interface
slide-6
SLIDE 6

66

Example

b_1 stateful operators stateful operators

slide-7
SLIDE 7

77

Example

  • Data-parallel training looks like this

Stateful queues Stateful variables Concurrent steps for data parallelism

slide-8
SLIDE 8

88

Dataflow Graph

  • Vertex: unit of local computation
  • Called operation in TensorFlow
  • Edges: inputs and outputs of computation
  • Values along edges are called tensors
slide-9
SLIDE 9

99

Tensors

  • Edges in dataflow graph
  • Data flowing among operators
  • Format
  • n-dimensional arrays
  • Elements have primitive types (including byte arrays)
  • Tensors are dense
  • All elements are represented
  • User must find ways to encode sparse data efficiently
slide-10
SLIDE 10

10

10

Operations

  • Vertices in dataflow graph
  • State is encapsulated in operations
  • Variables and queues
  • Access to state (and tensors)
  • Variable op: Returns unique reference handle
  • Read op: Take reference handle, produce value of variable
  • Write ops: Take reference and value and update.
  • Queues are also stateful operators
  • Get reference handle, modify through operations
  • Blocking semantics, backpressure, synchronization
slide-11
SLIDE 11

11

11

Execution Model

  • Step: client executes a subgraph by indicating:
  • Edges to feed the subgraph with input tensors
  • Edges to fetch the output tensors
  • Runtime prunes the subgraph to remove unnecessary steps
  • Subgraphs are run asynchronously by default
  • Can execute multiple partial, concurrent subgraphs
  • Example: concurrent batches for data-parallel training
slide-12
SLIDE 12

12

12

Distributed Execution

  • Tasks: named processes that send messages
  • PS tasks: store variables, but can also run computations
  • Worker tasks: the rest
  • Note: “informal” categories, not enforced by TensorFlow
  • Devices: CPU, GPU, TPU, mobile, …
  • CPU is the host device
  • Device executes kernel for each operation assigned to it
  • Same operation (e.g. matrix multiplication) has different kernels for different devices
  • Requirements for a device
  • Must accept kernel for execution
  • Must allocate memory for inputs and outputs
  • Must transfer data to and from host memory
slide-13
SLIDE 13

13

13

Distributed Execution

  • Each operation
  • Resides on a device
  • Corresponds to one or more kernel
  • More kernel can be specialized for different devices
  • Operations are executed within a task
slide-14
SLIDE 14

14

14

Distributed Scheduling

  • TensorFlow runtime places operations on devices
  • Implicit constraints: stateful operation on same device as state
  • Explicit constraints: dictated by the user
  • Optimal placement still open question
  • Obtain per-device subgraphs
  • All operations assigned to device
  • Send and Receive operations to replace edges
  • Specialized per-device implementations
  • CPU – GPU: CUDA memory copy
  • Across tasks: TCP or RDMA
  • Placement preserved throughout session
slide-15
SLIDE 15

15

15

Dynamic Control Flow

  • How do enable dynamic control flow with static graph?
  • Example: recurrent neural network
  • Train network for sequence of variable length without unrolling
  • Conditional: Switch and Merge

Switch

Data input Control input

  • p
  • p
  • p
  • p

Merge

input dead Output one non-dead input

slide-16
SLIDE 16

16

16

Loops

  • Uses three additional operators

Enter

Data input

  • p
  • p

Exit NextIteration

slide-17
SLIDE 17

17

Scaling to Large Models

  • Model parallelism
  • Avoids moving terabytes of parameters every time
  • Operations (typically implemented by library)
  • Gather: reads tensor data from shard and computes
  • Part: Partitions the input across shards of parameters
  • Stitch: Aggregates all partitions

parameters inputs

slide-18
SLIDE 18

18

18

Fault Tolerance

  • Long running tasks face failures and pre-emption
  • Sometimes run at night on idle machines
  • Small operations, no need to tolerate individual failures
  • Even RDDs are overkill
  • User use Save operation for checkpointing
  • Each variable in a task connected to same save for batching
  • Asynchronous, not consistent
  • Restore operation executed by clients at startup
  • Other use cases: transfer learning
slide-19
SLIDE 19

19

19

Coordination

  • TensorFlow is asynchronous by default
  • Stochastic Gradient Descent tolerates asynchrony
  • Asynchrony increases throughput
  • But synchrony has benefits
  • Using stale parameters slows down convergence
  • System must support user-defined synchrony
slide-20
SLIDE 20

20

20

Synchronous Coordination

  • Use blocking queues for synchrony
  • Redundant tasks for stragglers

blocking queues on inputs and outputs different colors = different versions

  • f parameter

proactive (not reactive) backup workers

slide-21
SLIDE 21

21

21

Implementation

  • Distributed master
  • Obtain subgraphs for

each participating device

  • Dataflow executor
  • Handles requests from

master

  • Schedules the execution
  • f the kernels of local

subgraph

  • Data transfer to device

and over network

slide-22
SLIDE 22

22

22

Single-Machine Performance

  • Similar to COST analysis
  • Comparison with single-server (not single-threaded) tools
  • Four convolutional models using one GPU
slide-23
SLIDE 23

23

Synchronous Microbenchmarks

  • Null training steps
  • Sparse performance is close to optimal (scalar)
slide-24
SLIDE 24

24

24

Scalability

  • Scalability bound by access to PS tasks (7 in the exp)
  • Synchronous coordination scales well
  • Backups are beneficial (but expensive way to do FT)