SLIDE 1
Scaling Deep Learning to 100s of GPUs on Hops Hadoop
Fabio Buso Software Engineer Logical Clocks AB
SLIDE 2 2
HopsFS: Next generation HDFS
37x
Number of fles
16x
Throughput
S c a l e C h a l l e n g e Wi n n e r ( 2 1 7 )
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf
SLIDE 3
3
Hops platform
Projects, Datasets, Users HopsFS, HopsYARN, MySQL NDB Cluster Spark, T ensorfow, Hive, Kafka, Flink Jupyter, Zeppelin Jobs, Grafana, ELK REST API
Version 0.3.0 just released!
SLIDE 4 4
Python frst
C
d a R e p
r
e c t C
d a e n v S e a r c h I n s t a l l / R e m
e P y t h
. 6 , p a n d a s
. 4 , N u m p y
9 E n v i r
m e n t u s a b l e b y S p a r k / T e n s
fl
Hops python library: Make development easy
- Hyperparameter searching
- Manage T
ensorboard lifecycle
SLIDE 5 5
Find big datasets - Dela*
- Discover, Share and experiment with
interesting datasets
- p2p network of Hops Cluster
- ImageNet, YouT
ube8M, Reddit comments...
- Exploits unused bandwidth
* h t t p : / / i e e e x p l
e . i e e e .
g / d
u m e n t / 7 9 8 2 2 5 / ( I C D C S 2 1 7 )
SLIDE 6
Scale out level: 1 Parallel Hyper parameter searching
SLIDE 7
7
Parallel Hyperparameter searching
def model(lr, dropout): … args_dict = { 'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]} args_dict_grid = util.grid_params(args_dict) tflauncher.launch(spark, model, args_dict_grid)
S t a r t s 6 p a r a l l e l e x p e r i m e n t s
SLIDE 8
Scale out Level: 2 Distributed Training
SLIDE 9 9
T ensorFlowOnSpark (TFoS) by Yahoo!
ensorFlow over Spark
- Runs on top of a Hadoop cluster
- PS/Workers executed inside Spark executors
- Uses Spark for resource allocations
– Our version: exclusive GPUs allocations – Parameter server(s) do not get GPU(s)
ensorboard
SLIDE 10
10
Run TFoS
def training_fun(argv, ctx): ….. TFNode.start_cluster_server() ….. TFCluster.run(spark, training_fun, num_exec, num_ps…) Full conversion guide: https://github.com/yahoo/T ensorFlowOnSpark/wiki/Conversio n-Guide
SLIDE 11
Scale out level: Master of the dark arts Horovod
SLIDE 12 12
PS server architecture doesn’t scale
F r
: h t t p s : / / g i t h u b . c
/ u b e r / h
SLIDE 13 13
Horovod by Uber
- Based on previous work done by Baidu
- Organize workers in a ring
- Gradients updates distributed using All-
Reduce
SLIDE 14
14
All-Reduce
G P U 1 G P U 2 G P U 3 a b c a 1 b 1 c 1 a 2 b 2 c 2
SLIDE 15
15
All-Reduce
a b c + c 2 a + a 1 b 1 c 1 a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3
SLIDE 16
16
All-Reduce
a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3
SLIDE 17
17
All-Reduce
a b + b 1 + b 2 c + c 2 a + a 1 b 1 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c 2 G P U 1 G P U 2 G P U 3
SLIDE 18
18
All-Reduce
a + a 1 + a 2 b + b 1 + b 2 c + c 2 a + a 1 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3
SLIDE 19
19
All-Reduce
a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 a + a 1 + a 2 b + b 1 + b 2 c + c 1 + c 2 G P U 1 G P U 2 G P U 3
SLIDE 20 20
Hops AllReduce
import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init()
- pt = hvd.DistributedOptimizer(opt)
if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch(spark, 'hdfs:///Projects/ …/all_reduce.ipynb')
SLIDE 21
Demo time!
SLIDE 22
Play with it → hops.io/?q=content/hopsworks-vagrant
Doc → hops.io Star us! → github.com/hopshadoop Follow us! → @hopshadoop