Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso - - PowerPoint PPT Presentation

▶

Jul 10, 2023 317 likes •549 views

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB HopsFS: Next generation HDFS 37x Number of fles 16x Throughput

SLIDE 1

Scaling Deep Learning to 100s of GPUs on Hops Hadoop

Fabio Buso Software Engineer Logical Clocks AB

SLIDE 2

2

HopsFS: Next generation HDFS

37x

Number of fles

16x

Throughput

S c a l e C h a l l e n g e Wi n n e r ( 2 1 7 )

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf

SLIDE 3

3

Hops platform

Projects, Datasets, Users HopsFS, HopsYARN, MySQL NDB Cluster Spark, T ensorfow, Hive, Kafka, Flink Jupyter, Zeppelin Jobs, Grafana, ELK REST API

Version 0.3.0 just released!

SLIDE 4

4

Python frst

C

d a R e p

r

e c t C

d a e n v S e a r c h I n s t a l l / R e m

e P y t h

. 6 , p a n d a s

. 4 , N u m p y

9 E n v i r

m e n t u s a b l e b y S p a r k / T e n s

fl

Hops python library: Make development easy

Hyperparameter searching
Manage T

ensorboard lifecycle

SLIDE 5

5

Find big datasets - Dela*

Discover, Share and experiment with

interesting datasets

p2p network of Hops Cluster
ImageNet, YouT

ube8M, Reddit comments...

Exploits unused bandwidth

* h t t p : / / i e e e x p l

e . i e e e .

g / d

u m e n t / 7 9 8 2 2 5 / ( I C D C S 2 1 7 )

SLIDE 6

Scale out level: 1 Parallel Hyper parameter searching

SLIDE 7

7

Parallel Hyperparameter searching

def model(lr, dropout): … args_dict = { 'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]} args_dict_grid = util.grid_params(args_dict) tflauncher.launch(spark, model, args_dict_grid)

S t a r t s 6 p a r a l l e l e x p e r i m e n t s

SLIDE 8

Scale out Level: 2 Distributed Training

SLIDE 9

9

T ensorFlowOnSpark (TFoS) by Yahoo!

Distributed T

ensorFlow over Spark

Runs on top of a Hadoop cluster
PS/Workers executed inside Spark executors
Uses Spark for resource allocations

– Our version: exclusive GPUs allocations – Parameter server(s) do not get GPU(s)

Manages T

ensorboard

SLIDE 10

10

Run TFoS

def training_fun(argv, ctx): ….. TFNode.start_cluster_server() ….. TFCluster.run(spark, training_fun, num_exec, num_ps…) Full conversion guide: https://github.com/yahoo/T ensorFlowOnSpark/wiki/Conversio n-Guide

SLIDE 11

Scale out level: Master of the dark arts Horovod

SLIDE 12

12

PS server architecture doesn’t scale

F r

: h t t p s : / / g i t h u b . c

/ u b e r / h

SLIDE 13

13

Horovod by Uber

Based on previous work done by Baidu
Organize workers in a ring
Gradients updates distributed using All-

Reduce

Synchronous protocol

SLIDE 14

14

All-Reduce

G P U 1 G P U 2 G P U 3 a b c a 1 b 1 c 1 a 2 b 2 c 2

SLIDE 15

15

All-Reduce

a b c + c 2 a + a 1 b 1 c 1 a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

SLIDE 16

16

All-Reduce

a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

SLIDE 17

17

All-Reduce

a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

SLIDE 18

18

All-Reduce

a + a 1 + a 2 b + b 1 + b 2 c + c 2 a + a 1 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3

SLIDE 19

19

All-Reduce

a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3

SLIDE 20

20

Hops AllReduce

import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init()

pt = hvd.DistributedOptimizer(opt)

if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/ …/all_reduce.ipynb')

SLIDE 21

Demo time!

SLIDE 22

Scaling Deep Learning to 100s of GPUs on Hops Hadoop

Fabio Buso Software Engineer Logical Clocks AB

2

HopsFS: Next generation HDFS

S c a l e C h a l l e n g e Wi n n e r ( 2 1 7 )

3

Hops platform

Projects, Datasets, Users HopsFS, HopsYARN, MySQL NDB Cluster Spark, T ensorfow, Hive, Kafka, Flink Jupyter, Zeppelin Jobs, Grafana, ELK REST API

Version 0.3.0 just released!

4

Python frst

C

d a R e p

r

e c t C

d a e n v S e a r c h I n s t a l l / R e m

e P y t h

. 6 , p a n d a s

. 4 , N u m p y

9 E n v i r

m e n t u s a b l e b y S p a r k / T e n s

fl

Hops python library: Make development easy

ensorboard lifecycle

5

Find big datasets - Dela*

interesting datasets

ube8M, Reddit comments...

Scale out level: 1 Parallel Hyper parameter searching

7

Parallel Hyperparameter searching

def model(lr, dropout): … args_dict = { 'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]} args_dict_grid = util.grid_params(args_dict) tflauncher.launch(spark, model, args_dict_grid)

S t a r t s 6 p a r a l l e l e x p e r i m e n t s

Scale out Level: 2 Distributed Training

9

T ensorFlowOnSpark (TFoS) by Yahoo!

ensorFlow over Spark

– Our version: exclusive GPUs allocations – Parameter server(s) do not get GPU(s)

ensorboard

10

Run TFoS

def training_fun(argv, ctx): ….. TFNode.start_cluster_server() ….. TFCluster.run(spark, training_fun, num_exec, num_ps…) Full conversion guide: https://github.com/yahoo/T ensorFlowOnSpark/wiki/Conversio n-Guide

Scale out level: Master of the dark arts Horovod

12

PS server architecture doesn’t scale

F r

: h t t p s : / / g i t h u b . c

/ u b e r / h

13

Horovod by Uber

Reduce

14

All-Reduce

G P U 1 G P U 2 G P U 3 a b c a 1 b 1 c 1 a 2 b 2 c 2

15

All-Reduce

a b c + c 2 a + a 1 b 1 c 1 a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

16

All-Reduce

a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

17

All-Reduce

a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

18

All-Reduce

a + a 1 + a 2 b + b 1 + b 2 c + c 2 a + a 1 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3

19

All-Reduce

a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3

20

Hops AllReduce

import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init()

if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/ …/all_reduce.ipynb')

Demo time!

Play with it → hops.io/?q=content/hopsworks-vagrant

Doc → hops.io Star us! → github.com/hopshadoop Follow us! → @hopshadoop