[PPT] - Time-Cost Trade-offs of Pipelined Dataflow Applications Jonathan Kho PowerPoint Presentation

SLIDE 1

Time-Cost Trade-offs of Pipelined Dataflow Applications

Jonathan Kho1, Erik Saule4, Anas AbuDoleh1, Xusheng Wang1, Hao Ding2, Kun Huang1, Raghu Machiraju1,2, ¨ Umit V. C ¸ataly¨ urek1,3

1 Biomedical Informatics, The Ohio State University 2 Computer Science and Engineering, The Ohio State University 3 Electrical and Compute Engineering, The Ohio State University 4 Computer Science, UNC Charlotte

Aussois 2016

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 1 / 27

SLIDE 2

Featuring the three wise men (that never heard of this talk)

Loris Pierre-Francois Guochuan

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 2 / 27

SLIDE 3

Context

Cloud computing promises on demand resources Different types of computing resources are available Arbitrary speedups are in principle possible The catch is that you have to pay for resources used The problem becomes a tradeoff between the runtime of an application and cost of executing it In this presentation, we show that the pipelined dataflow abstraction is good for expressing this tradeoff because runtime is “easy” to predict. We use a particular imaging application to examplify the technique.

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 3 / 27

SLIDE 4

Feature Extraction from Histopathological Slides

Biopsy slides Non-blank patch Preprocess SuperPixel segmentation LBP feature extraction Varying sizes in the order of 100k × 100k pixels. Aperio Format with thumbnail (about 1GB/file, 24GB uncompressed) Available public repository (TCGA) with 1000s of participants samples

3 slides per patients.

Can be used to predict whether the biopsy is cancerous Will consider two instances: twoparticipants (2 participants) and allslides (42 participants)

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 4 / 27

SLIDE 5

Outline

1

Introduction

2

Predicting Runtime of Pipelined Dataflow Application

3

A Flowshop Problem

4

Time-Cost Tradeoff

5

Conclusion

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 5 / 27

SLIDE 6

Pipelined workflow

Layout

Image analysis Reader Segmentation LBP Features Partitioner

Reader discards background tiles

Placement

Node 1 1 CPU Core Node 2 4 CPU Cores Node 3 4 CPU Core + 1 GPU Node 0 1 CPU Core

Advantages

Sequential processes Heterogeneous Replication for throughput Comm/Comp overlap

Application

Medical imaging Stock option pricing Synthetic Aperture Radar Incremental graph algorithm

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 6 / 27

SLIDE 7

How to predict runtime ?

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 7 / 27

SLIDE 8

How to predict runtime ?

FOO In a pipelined system what matters is the steady state! The throughput is given by the most loaded node.

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 7 / 27

SLIDE 9

Runtime in a (simple) pipelined dataflow model

Model

An application of M stages J identical jobs Stage i processes a job in pi

One-to-one mapping

With one processor per stage The execution is constrained by the slowest stage Period P = maxi pi Throughput T = 1

P

S1 S2 S3 p1

P = P = p2

p3 JP JP+(∑pi-P) 2 4 6 8 time 1 3 5 7

p1 = 1, p2 = 2, p3 = 1.5

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 8 / 27

SLIDE 10

Runtime in a (more complex) pipelined dataflow model

Replication

It is possible in some application to replicate some stages to increase the throughput If stage i is replicated ri times i processes at a rate τi = ri

pi

Throughput T = maxiτi Period P = 1

T

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 9 / 27

SLIDE 11

Runtime in a (more complex) pipelined dataflow model

Replication

It is possible in some application to replicate some stages to increase the throughput If stage i is replicated ri times i processes at a rate τi = ri

pi

Throughput T = maxiτi Period P = 1

T

Heterogeneity

It is possible in some application to replicate on different systems. If stage i is replicated on a CPU and a GPU i processes at a rate τi =

1 pcpu

i

+

1 pgpu

i

Throughput T = maxiτi Period P = 1

T

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 9 / 27

SLIDE 12

Runtime in a (more complex) pipelined dataflow model

Replication

It is possible in some application to replicate some stages to increase the throughput If stage i is replicated ri times i processes at a rate τi = ri

pi

Throughput T = maxiτi Period P = 1

T

Heterogeneity

It is possible in some application to replicate on different systems. If stage i is replicated on a CPU and a GPU i processes at a rate τi =

1 pcpu

i

+

1 pgpu

i

Throughput T = maxiτi Period P = 1

T

These two techniques combine !

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 9 / 27

SLIDE 13

Experimental settings and model calibration

Machine

32-node cluster Two Xeon E5520 (quad core) An NVIDIA C2050 DDR4x Infiniband

Software

g++ 4.8.1 mvapich2 2.2 DataCutter (dcmpi) Openslide 3.4.1 gSLIC nvcc 7.0.27

Tile prediction

based on thumbnail:

50 100 150 200 250 300 350 400 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 Tiles Slide Index Total Valid

Model Calibration

Estimated Estimated Slide Filesize Width Height Total Tiles Valid Tiles TCGA-BH-A18V-01A-01-TSA 432.93MB 98,631 33,244 225 78 TCGA-BH-A18J-01A-01-TSA 322.01MB 112,037 29,845 224 75

ImAn:

CPU / GPU

Proc. Time

Local τIA Average τIA Speedup NVIDIA Tesla C2050 447.41 s 2.924M px/s 422.03 s 2.981M px/s 2.953M px/s 1 Intel Xeon E5520 (7 cores) 399.11 s 3.278M px/s 378.83 s 3.321M px/s 3.299M px/s 1.117

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 10 / 27

SLIDE 14

A first log

Total Read White Valid Analyzed MRT MAT

200 400 600 800 1000 1200 50 100 150 200 250 300 350 400 450 500

Tiles Walltime (s)

1 Reader. 3 GPUs. Two Patients. Natural ordering. (Eventually ImAn idles because too many White are read.)

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 11 / 27

SLIDE 15

How to fix this ?

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 12 / 27

SLIDE 16

How to fix this ?

x The Valid tiles are more computa- tionally expensive than the White

nes. Valid first should work fine!

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 12 / 27

SLIDE 17

Valid First does not always work

Total Read White Valid Analyzed MRT MAT

2000 4000 6000 8000 10000 12000 14000 16000 2000 4000 6000 8000 10000

Tiles Walltime (s)

1 Reader. 2 GPUs. All Slides. Valid First. (The system has bounded memory and eventually Reader stalls.)

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 13 / 27

SLIDE 18

Outline

1

Introduction

2

Predicting Runtime of Pipelined Dataflow Application

3

A Flowshop Problem

4

Time-Cost Tradeoff

5

Conclusion

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 14 / 27

SLIDE 19

Flowshop

Deciding which job to process next in its simplest form is a Flowshop problem.

Model

M stages J jobs job j in stage i takes time pi,j Order the job to minimize the makespan

Bad News

NP-Complete in this form That is actually an abstraction

f the real problem

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 15 / 27

SLIDE 20

How to make the problem computationally simpler?

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 16 / 27

SLIDE 21

How to make the problem computationally simpler?

x Since you have categories of jobs, the pi,j matrix is actually low rank. That helped in R||Cmax. Maybe it helps here?

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 16 / 27

SLIDE 22

Interleave schedule

Insight

We have: C categories of jobs Jc jobs in category c Jc are large numbers Sounds like something cyclic should work

Algorithm

Build k batches with sc = Jc

k jobs of category c

Asymptotic optimality

Each batch can be seen as a meta job in a one-to-one mapping. When k goes to infinity, the makespan of the flowshop problem converges to the optimal value of the pipelined scheduling problem. So with lots of jobs, performance is good.

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 17 / 27

SLIDE 23

Dismissed Constraints

Divisibility

The number of jobs might be prime, but rational approximation works just fine.

Heterogeneity

Called hybrid problem in the flowshop world. Heterogeneous just makes different pi,j.

Onlineness

Non-clairvoyance can be solved with random ordering.

Low-Rank

Categories and low-rank are slightly different. (low rank admits linear combination of categories.) Low-rank can be solved by some weighted interleave schedule

Communication

Often modeled as an additional stage of processing.

Blocking Writes

As long as one batch does not saturate memory, pipelining will happen gracefully.

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 18 / 27

SLIDE 24

In practice

Total Read White Valid Analyzed MRT MAT

2000 4000 6000 8000 10000 12000 14000 16000 2000 4000 6000 8000 10000

Tiles Walltime (s)

1 reader. 2 GPU. all slide. interleave. (All cases work just fine.)

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 19 / 27

SLIDE 25

Outline

1

Introduction

2

Predicting Runtime of Pipelined Dataflow Application

3

A Flowshop Problem

4

Time-Cost Tradeoff

5

Conclusion

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 20 / 27

SLIDE 26

Model

Time

Parameters: W white tiles V valid tiles r CPU Reader at a rate τ CPU

Read

c CPU ImAn at rate τ CPU

ImAn

g GPU ImAn at rate τ GPU

ImAn

Prediction: τRead = rτ CPU

Read

τImAn = cτ CPU

ImAn + gτ GPU ImAn

Cmax = max( W +V

τRead , V τImAn )

Cost

Amazon EC2 charges per hour (MS Azure charges per minute) So the charge is Cmax

3600

((r + c) ∗ Cc + g ∗ Cg)

What you can get

In EC2, you can get a cg1 instance with 2 NVIDIA M2050 and 8 Xeon core for $2.1 per hour. For the reader, you can use a c1.medium that gives a Xeon core for $0.13 per hour.

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 21 / 27

SLIDE 27

(1 + ǫ)-approximation of Time-Cost

Cost under time constraint

If you set a cap T on time, then you obtain bounds τRead ≥ W +V

Cmax

and τImAn ≥

V Cmax

So: rτ c1

Read ≥ W +V Cmax

r ≥

W +V τ c1

ReadCmax

and gτ cg1

ImAn ≥ V Cmax

g ≥

V Cmaxτ cg1

ImAn

Min cost: pick smallest r and g.

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 22 / 27

SLIDE 28

(1 + ǫ)-approximation of Time-Cost

Cost under time constraint

If you set a cap T on time, then you obtain bounds τRead ≥ W +V

Cmax

and τImAn ≥

V Cmax

So: rτ c1

Read ≥ W +V Cmax

r ≥

W +V τ c1

ReadCmax

and gτ cg1

ImAn ≥ V Cmax

g ≥

V Cmaxτ cg1

ImAn

Min cost: pick smallest r and g.

Pareto approximation

Using Papadimitriou and Yannakakis scheme. Pick Tmin and Tmax and a basis 1 + ǫ. Return solution for T = (1 + ǫ)kTmin for all k ∈ N; 0 ≤ k ≤

log1+ǫ

Tmax Tmin

That set is a (1 + ǫ) approximation
f the Pareto set.

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 22 / 27

SLIDE 29

Some Values

20 40 60 80 100 120 100 200 300 400 500 600 700 800 900 Cost (in USD) Time (in seconds) T wo patients per hour pricing per minute pricing approximation

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 23 / 27

SLIDE 30

Outline

1

Introduction

2

Predicting Runtime of Pipelined Dataflow Application

3

A Flowshop Problem

4

Time-Cost Tradeoff

5

Conclusion

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 24 / 27

SLIDE 31

Conclusion

Predicting the runtime of pipelined dataflow application is feasible

Simple bottleneck analysis should work Just make sure there are no artificial bubbles in the execution Integrates heterogeneous processors gracefully

Time Cost tradeoff in the cloud

Once you have a closed formula for the runtime, picking the cheapest machine to finish the application in a given time is easy Finding an approximation of the Pareto-Curve is immediate

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 25 / 27

SLIDE 32

Future Works

Does low-rank matrices make flowshop easier ?

Here it works becasue we have lots of jobs. Even in the hybrid case ?

Dynamic pricing

Spot instances have varying price in time. Can we do a similar analysis with dynamic pricing ?

Power and Energy

There are works in pipelined execution with energetic objective. Can we leverage them in practice ?

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 26 / 27

SLIDE 33

Thank you

and thanks to the three wisemen!

More information

Contact : esaule@uncc.edu Visit: http://webpages.uncc.edu/~esaule

Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 27 / 27