Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso - - PowerPoint PPT Presentation

scaling deep learning to 100s of gpus on hops hadoop
SMART_READER_LITE
LIVE PREVIEW

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso - - PowerPoint PPT Presentation

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB HopsFS: Next generation HDFS 37x Number of fles 16x Throughput


slide-1
SLIDE 1

Scaling Deep Learning to 100s of GPUs on Hops Hadoop

Fabio Buso Software Engineer Logical Clocks AB

slide-2
SLIDE 2

2

HopsFS: Next generation HDFS

37x

Number of fles

16x

Throughput

S c a l e C h a l l e n g e Wi n n e r ( 2 1 7 )

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf

slide-3
SLIDE 3

3

Hops platform

Projects, Datasets, Users HopsFS, HopsYARN, MySQL NDB Cluster Spark, T ensorfow, Hive, Kafka, Flink Jupyter, Zeppelin Jobs, Grafana, ELK REST API

Version 0.3.0 just released!

slide-4
SLIDE 4

4

Python frst

C

  • n

d a R e p

  • P

r

  • j

e c t C

  • n

d a e n v S e a r c h I n s t a l l / R e m

  • v

e P y t h

  • n
  • 3

. 6 , p a n d a s

  • 1

. 4 , N u m p y

  • .

9 E n v i r

  • n

m e n t u s a b l e b y S p a r k / T e n s

  • r

fl

  • w

Hops python library: Make development easy

  • Hyperparameter searching
  • Manage T

ensorboard lifecycle

slide-5
SLIDE 5

5

Find big datasets - Dela*

  • Discover, Share and experiment with

interesting datasets

  • p2p network of Hops Cluster
  • ImageNet, YouT

ube8M, Reddit comments...

  • Exploits unused bandwidth

* h t t p : / / i e e e x p l

  • r

e . i e e e .

  • r

g / d

  • c

u m e n t / 7 9 8 2 2 5 / ( I C D C S 2 1 7 )

slide-6
SLIDE 6

Scale out level: 1 Parallel Hyper parameter searching

slide-7
SLIDE 7

7

Parallel Hyperparameter searching

def model(lr, dropout): … args_dict = { 'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]} args_dict_grid = util.grid_params(args_dict) tflauncher.launch(spark, model, args_dict_grid)

S t a r t s 6 p a r a l l e l e x p e r i m e n t s

slide-8
SLIDE 8

Scale out Level: 2 Distributed Training

slide-9
SLIDE 9

9

T ensorFlowOnSpark (TFoS) by Yahoo!

  • Distributed T

ensorFlow over Spark

  • Runs on top of a Hadoop cluster
  • PS/Workers executed inside Spark executors
  • Uses Spark for resource allocations

– Our version: exclusive GPUs allocations – Parameter server(s) do not get GPU(s)

  • Manages T

ensorboard

slide-10
SLIDE 10

10

Run TFoS

def training_fun(argv, ctx): ….. TFNode.start_cluster_server() ….. TFCluster.run(spark, training_fun, num_exec, num_ps…) Full conversion guide: https://github.com/yahoo/T ensorFlowOnSpark/wiki/Conversio n-Guide

slide-11
SLIDE 11

Scale out level: Master of the dark arts Horovod

slide-12
SLIDE 12

12

PS server architecture doesn’t scale

F r

  • m

: h t t p s : / / g i t h u b . c

  • m

/ u b e r / h

  • r
  • v
  • d
slide-13
SLIDE 13

13

Horovod by Uber

  • Based on previous work done by Baidu
  • Organize workers in a ring
  • Gradients updates distributed using All-

Reduce

  • Synchronous protocol
slide-14
SLIDE 14

14

All-Reduce

G P U 1 G P U 2 G P U 3 a b c a 1 b 1 c 1 a 2 b 2 c 2

slide-15
SLIDE 15

15

All-Reduce

a b c + c 2 a + a 1 b 1 c 1 a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

slide-16
SLIDE 16

16

All-Reduce

a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

slide-17
SLIDE 17

17

All-Reduce

a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3

slide-18
SLIDE 18

18

All-Reduce

a + a 1 + a 2 b + b 1 + b 2 c + c 2 a + a 1 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3

slide-19
SLIDE 19

19

All-Reduce

a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3

slide-20
SLIDE 20

20

Hops AllReduce

import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init()

  • pt = hvd.DistributedOptimizer(opt)

if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/ …/all_reduce.ipynb')

slide-21
SLIDE 21

Demo time!

slide-22
SLIDE 22

Play with it → hops.io/?q=content/hopsworks-vagrant

Doc → hops.io Star us! → github.com/hopshadoop Follow us! → @hopshadoop