[PPT] - Distributed Deep Learning Using Hopsworks SF Machine Learning PowerPoint Presentation

SLIDE 1

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

Distributed Deep Learning Using Hopsworks

SF Machine Learning Mesosphere Kim Hammar kim@logicalclocks.com

SLIDE 2

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DISTRIBUTED COMPUTING + DEEP LEARNING = ?

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Distributed Computing Deep Learning

Why Combine the two?

2em11 Chen Sun et al. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”. In: CoRR abs/1707.02968 (2017). arXiv: 1707.02968. URL: http://arxiv.org/abs/1707.02968. 2em12 Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1223–1231.

SLIDE 3

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DISTRIBUTED COMPUTING + DEEP LEARNING = ?

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Distributed Computing Deep Learning

Why Combine the two?

◮ We like challenging problems

2em11 Chen Sun et al. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”. In: CoRR abs/1707.02968 (2017). arXiv: 1707.02968. URL: http://arxiv.org/abs/1707.02968. 2em12 Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1223–1231.

SLIDE 4

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DISTRIBUTED COMPUTING + DEEP LEARNING = ?

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Distributed Computing Deep Learning

Why Combine the two?

◮ We like challenging problems ◮ More productive data science ◮ Unreasonable effectiveness of data1 ◮ To achieve state-of-the-art results2

2em11 Chen Sun et al. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”. In: CoRR abs/1707.02968 (2017). arXiv: 1707.02968. URL: http://arxiv.org/abs/1707.02968. 2em12 Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1223–1231.

SLIDE 5

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DISTRIBUTED DEEP LEARNING (DDL): PREDICTABLE SCALING

3

2em13 Jeff Dean. Building Intelligent Systems withLarge Scale Deep Learning. https : / / www . scribd . com / document/355752799/Jeff-Dean-s-Lecture-for-YC-AI. 2018.

SLIDE 6

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DISTRIBUTED DEEP LEARNING (DDL): PREDICTABLE SCALING

SLIDE 7

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DDL IS NOT A SECRET ANYMORE

4

2em14 Tal Ben-Nun and Torsten Hoefler. “Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis”. In: CoRR abs/1802.09941 (2018). arXiv: 1802.09941. URL: http://arxiv.org/abs/ 1802.09941.

SLIDE 8

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DDL IS NOT A SECRET ANYMORE

TensorflowOnSpark CaffeOnSpark Distributed TF

Frameworks for DDL Companies using DDL

SLIDE 9

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DDL REQUIRES AN ENTIRE

SOFTWARE/INFRASTRUCTURE STACK

e1 e2 e3

Distributed Training

e4 Gradient ∇ Gradient ∇ Gradient ∇ Gradient ∇

Distributed Systems Data Validation Feature Engineering Data Collection Hardware Management HyperParameter Tuning Model Serving Pipeline Management A/B Testing Monitoring

SLIDE 10

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTLINE

1. Hopsworks: Background of the platform
2. Managed Distributed Deep Learning using HopsYARN,

HopsML, PySpark, and Tensorflow

3. Black-Box Optimization using Hopsworks, Metadata

Store, PySpark, and Maggy5

2em15 Moritz Meister and Sina Sheikholeslami. Maggy. https://github.com/logicalclocks/maggy. 2019.

SLIDE 11

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTLINE

1. Hopsworks: Background of the platform
2. Managed Distributed Deep Learning using HopsYARN,

HopsML, PySpark, and Tensorflow

3. Black-Box Optimization using Hopsworks, Metadata

Store, PySpark, and Maggy6

2em16 Moritz Meister and Sina Sheikholeslami. Maggy. https://github.com/logicalclocks/maggy. 2019.

SLIDE 12

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS

SLIDE 13

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS

HopsFS

SLIDE 14

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS

HopsFS HopsYARN

(GPU/CPU as a resource)

SLIDE 15

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS

HopsFS HopsYARN

(GPU/CPU as a resource)

Frameworks

(ML/Data)

SLIDE 16

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS

HopsFS HopsYARN

(GPU/CPU as a resource)

Frameworks

(ML/Data) Feature Store Pipelines Experiments Models

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

ML/AI Assets

SLIDE 17

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS

HopsFS HopsYARN

(GPU/CPU as a resource)

Frameworks

(ML/Data) Feature Store Pipelines Experiments Models

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

ML/AI Assets

from hops import featurestore from hops import experiment featurestore.get_features([ "average_attendance", "average_player_age"]) experiment.collective_all_reduce(features , model)

APIs

SLIDE 18

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS

HopsFS HopsYARN

(GPU/CPU as a resource)

Frameworks

(ML/Data)

Distributed Metadata

(Available from REST API) Feature Store Pipelines Experiments Models

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

ML/AI Assets

from hops import featurestore from hops import experiment featurestore.get_features([ "average_attendance", "average_player_age"]) experiment.collective_all_reduce(features , model)

APIs

SLIDE 19

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

INNER AND OUTER LOOP OF LARGE SCALE DEEP LEARNING

Inner loop

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y

. . .

worker1 worker2 workerN Data Synchronization

∇1 ∇2 ∇N

SLIDE 20

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

INNER AND OUTER LOOP OF LARGE SCALE DEEP LEARNING

Outer loop Metric τ Search Method hparams h

Inner loop

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y

. . .

worker1 worker2 workerN Data Synchronization

∇1 ∇2 ∇N

SLIDE 21

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

INNER AND OUTER LOOP OF LARGE SCALE DEEP LEARNING

Outer loop Metric τ Search Method hparams h

Inner loop

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y

. . .

worker1 worker2 workerN Data Synchronization

∇1 ∇2 ∇N

SLIDE 22

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

INNER LOOP: DISTRIBUTED DEEP LEARNING

e1 e2 e3 e4

SLIDE 23

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

INNER LOOP: DISTRIBUTED DEEP LEARNING

e1 e2 e3 e4

Gradient ∇ Gradient ∇ Gradient ∇ Gradient ∇

SLIDE 24

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

INNER LOOP: DISTRIBUTED DEEP LEARNING

e1 e2 e3 e4 p1 p2 p3 p4

Gradient ∇ Gradient ∇ Gradient ∇ Gradient ∇ Data Partition Data Partition Data Partition Data Partition

SLIDE 25

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

DISTRIBUTED DEEP LEARNING IN PRACTICE

◮ Implementation

f distributed

algorithms is becoming a commodity (TF, PyTorch etc)

◮ The hardest part

f DDL is now:

◮ Cluster

management

◮ Allocating

GPUs

◮ Data

management

◮ Operations &

performance

?

Models GPUs Data Distribution

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

SLIDE 26

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

SLIDE 27

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

from hops import experiment experiment.collective_all_reduce(train_fn)

SLIDE 28

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

from hops import experiment experiment. collective_all_reduce( train_fn ) HopsYARN RM YARN container GPU as a resource YARN container GPU as a resource YARN container GPU as a resource YARN container GPU as a resource

Resource requests Client API YARN container

SLIDE 29

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

from hops import experiment experiment. collective_all_reduce( train_fn ) HopsYARN RM YARN container GPU as a resource Spark executor YARN container GPU as a resource Spark executor YARN container GPU as a resource Spark executor YARN container GPU as a resource Spark executor

Resource requests Client API YARN container

Spark driver

SLIDE 30

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

from hops import experiment experiment. collective_all_reduce( train_fn ) HopsYARN RM YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor

Resource requests Client API YARN container conda env

Spark driver

SLIDE 31

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

from hops import experiment experiment. collective_all_reduce( train_fn ) HopsYARN RM YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor

Resource requests Client API Here is my ip: 192.168.1.1 Here is my ip: 192.168.1.2 Here is my ip: 192.168.1.3 Here is my ip: 192.168.1.4 YARN container conda env

Spark driver

SLIDE 32

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

from hops import experiment experiment. collective_all_reduce( train_fn ) HopsYARN RM YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor

Resource requests Client API Gradient ∇ Gradient ∇ Gradient ∇ Gradient ∇ YARN container conda env

Spark driver

Hops Distributed File System (HopsFS)

SLIDE 33

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

HOPSWORKS DDL SOLUTION

from hops import experiment experiment. collective_all_reduce( train_fn ) HopsYARN RM YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor YARN container GPU as a resource conda env Spark executor

Resource requests Client API Gradient ∇ Gradient ∇ Gradient ∇ Gradient ∇ YARN container conda env

Spark driver

Hops Distributed File System (HopsFS)

◮

Hide complexity behind simple API

◮

Allocate resources using pyspark

◮

Allocate GPUs for spark executors using HopsYARN

◮

Serve sharded training data to workers from HopsFS

◮

Use HopsFS for aggregating logs, checkpoints and results

◮

Store experiment metadata in metastore

◮

Use dynamic allocation for interactive resource management

SLIDE 34

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Outer loop Metric τ Search Method hparams h

Inner loop

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y

. . .

worker1 worker2 workerN Data Synchronization

∇1 ∇2 ∇N

SLIDE 35

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Outer loop Metric τ Search Method hparams h

Inner loop

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 b1 x1,1 x1,2 x1,3 ˆ y

. . .

worker1 worker2 workerN Data Synchronization

∇1 ∇2 ∇N

SLIDE 36

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

    x1 . . . xn    

Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 37

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Example Use-Case from one of our clients:

◮ Goal: Train a One-Class GAN model for fraud detection ◮ Problem: GANs are extremely sensitive to

hyperparameters and there exists a very large space of possible hyperparameters.

◮ Example hyperparameters to tune: learning rates η,

ptimizers, layers.. etc.

Real input x Random Noise z Generator

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Discriminator

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

SLIDE 38

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 39

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

η1, .. η2, .. η3, .. η4, .. η5, ..

Shared Task Queue Parallel Workers w1 w1 w1 w1

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 40

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

SLIDE 41

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

η1, .. η2, .. η3, .. η4, .. η5, ..

Shared Task Queue Parallel Workers w1 w1 w1 w1

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 42

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

η1, .. η2, .. η3, .. η4, .. η5, ..

Shared Task Queue Parallel Workers w1 w1 w1 w1 Which algorithm to use for search?

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 43

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

η1, .. η2, .. η3, .. η4, .. η5, ..

Shared Task Queue Parallel Workers w1 w1 w1 w1 How to monitor progress? Which algorithm to use for search?

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 44

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

η1, .. η2, .. η3, .. η4, .. η5, ..

Shared Task Queue Parallel Workers w1 w1 w1 w1 How to aggregate results? How to monitor progress? Which algorithm to use for search?

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 45

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

η1, .. η2, .. η3, .. η4, .. η5, ..

Shared Task Queue Parallel Workers w1 w1 w1 w1 How to aggregate results? How to monitor progress? Which algorithm to use for search? Fault Tolerance?

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

SLIDE 46

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

OUTER LOOP: BLACK BOX OPTIMIZATION

Num Neurons/Layer 25 30 35 40 45 N u m L a y e r s 2 4 6 8 10 12 L e a r n i n g R a t e 0.00 0.02 0.04 0.06 0.08 0.10

Search Space

η1, .. η2, .. η3, .. η4, .. η5, ..

Shared Task Queue Parallel Workers w1 w1 w1 w1 How to aggregate results? How to monitor progress? Which algorithm to use for search? Fault Tolerance?

    x1 . . . xn     Features Hyperparameters η, num_layers, neurons

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

This should be managed with platform support!

SLIDE 47

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: A FRAMEWORK FOR SYNCHRONOUS/ASYNCHRONOUS HYPERPARAMETER TUNING ON HOPSWORKS7

A flexible framework for running different black-box

ptimization algorithms on Hopsworks

◮ ASHA, Hyperband, Differential Evolution, Random

search, Grid search, etc.

2em17 Authors of Maggy: Moritz Meister and Sina Sheikholeslami. Author of the base framework that Maggy builds on: Robin Andersson

SLIDE 48

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: A FRAMEWORK FOR SYNCHRONOUS/ASYNCHRONOUS HYPERPARAMETER TUNING ON HOPSWORKS7

A flexible framework for running different black-box

ptimization algorithms on Hopsworks

◮ ASHA, Hyperband, Differential Evolution, Random

search, Grid search, etc.

2em17 Authors of Maggy: Moritz Meister and Sina Sheikholeslami. Author of the base framework that Maggy builds on: Robin Andersson

SLIDE 49

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

FRAMEWORK SUPPORT FOR SYNCHRONOUS SEARCH ALGORITHMS

Un-directed search N spark tasks N eval metrics Max/Min

h

Driver/Parameter Server Synchronous directed search (2 iterations) N spark tasks Barrier N spark tasks N eval metrics Synchronization Barrier

. . .

N eval metrics Synchronization Driver/Parameter Server

◮ Parallel undirected/synchronous search is trivial using

Spark and a distributed file system

◮ Example of un-directed search algorithms: random and

grid search

◮ Example of synchronous search algorithms: differential

evolution

SLIDE 50

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

FRAMEWORK SUPPORT FOR SYNCHRONOUS SEARCH ALGORITHMS

Fits very well with Spark BSP model

Un-directed search N spark tasks N eval metrics Max/Min

h

Driver/Parameter Server Synchronous directed search (2 iterations) N spark tasks Barrier N spark tasks N eval metrics Synchronization Barrier

. . .

N eval metrics Synchronization Driver/Parameter Server

◮ Parallel un-directed/synchronous search is trivial using

Spark and a distributed file system

◮ Example of un-directed search algorithms: random and

grid search

◮ Example of synchronous search algorithms: differential

evolution

SLIDE 51

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

PROBLEM WITH THE BULK-SYNCHRONOUS PROCESSING MODEL FOR PARALLEL SEARCH

N spark tasks Barrier N eval metrics Waiting.. wasted compute Driver/Parameter Server

◮ Synchronous search is sensitive to stragglers and not suitable for

early stopping

◮ ... For large scale search problems we need asynchronous search ◮ Problem: Asynchronous search is much harder to implement

with big data processing tools such as Spark

SLIDE 52

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

ENTER MAGGY: A FRAMEWORK FOR RUNNING ASYNCHRONOUS SEARCH ALGORITHMS ON HOPS

1 spark task/worker HopsFS Async Task Queue/Driver/Parameter Server

SLIDE 53

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

ENTER MAGGY: A FRAMEWORK FOR RUNNING ASYNCHRONOUS SEARCH ALGORITHMS ON HOPS

1 spark task/worker, many async tasks inside HopsFS Async Task Queue/Driver/Parameter Server

SLIDE 54

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

ENTER MAGGY: A FRAMEWORK FOR RUNNING ASYNCHRONOUS SEARCH ALGORITHMS ON HOPS

1 spark task/worker, many async tasks inside HopsFS Write checkpoints & results Async Task Queue/Driver/Parameter Server

SLIDE 55

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

ENTER MAGGY: A FRAMEWORK FOR RUNNING ASYNCHRONOUS SEARCH ALGORITHMS ON HOPS

1 spark task/worker, many async tasks inside HopsFS Write checkpoints & results Async fetching

f new tasks

(RPC framework) Async Task Queue/Driver/Parameter Server

SLIDE 56

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

ENTER MAGGY: A FRAMEWORK FOR RUNNING ASYNCHRONOUS SEARCH ALGORITHMS ON HOPS

◮ Robust against

stragglers

◮ Supports early

stopping

◮ Fault tolerance with

checkpointing

◮ Monitoring with

Tensorboard

◮ Log aggregation

with HopsFS

◮ Simple API and

extendable

1 spark task/worker, many async tasks inside HopsFS Write checkpoints & results Async fetching

f new tasks

(RPC framework) Async Task Queue/Driver/Parameter Server

SLIDE 57

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: ASYNCHRONOUS SEARCH WORKFLOW

λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α

Workers Coordinator

Global Task Queue

20 40 60 80 100 Epochs Accuracy lr=0.0021,layers=5 lr=0.01,layers=2 lr=0.01,layers=10 lr=0.001,layers=15 lr=0.001,layers=25 lr=0.019,layers=5 lr=0.001,layers=7 lr=0.01,layers=4 lr=0.0014,layers=3 lr=0.05,layers=1

Trials Progress Black-Box Optimziers minx f(x) x ∈ S

Suggested tasks Results Suggested tasks Results Suggested tasks Results

SLIDE 58

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: ASYNCHRONOUS SEARCH WORKFLOW

λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α

Workers Coordinator

Global Task Queue

20 40 60 80 100 Epochs Accuracy lr=0.0021,layers=5 lr=0.01,layers=2 lr=0.01,layers=10 lr=0.001,layers=15 lr=0.001,layers=25 lr=0.019,layers=5 lr=0.001,layers=7 lr=0.01,layers=4 lr=0.0014,layers=3 lr=0.05,layers=1

Trials Progress Black-Box Optimziers minx f(x) x ∈ S

Suggested tasks Results Heartbeats Suggested tasks Results Heartbeats Suggested tasks Results Heartbeats

SLIDE 59

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: ASYNCHRONOUS SEARCH WORKFLOW

λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α

Workers Coordinator

Global Task Queue

20 40 60 80 100 Epochs Accuracy lr=0.0021,layers=5 lr=0.01,layers=2 lr=0.01,layers=10 lr=0.001,layers=15 lr=0.001,layers=25 lr=0.019,layers=5 lr=0.001,layers=7 lr=0.01,layers=4 lr=0.0014,layers=3 lr=0.05,layers=1

Trials Progress Black-Box Optimziers minx f(x) x ∈ S

Suggested tasks Results Early Stop Heartbeats Suggested tasks Results Early Stop Heartbeats Suggested tasks Results Heartbeats

SLIDE 60

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: ASYNCHRONOUS SEARCH WORKFLOW

λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α λ

Suggestions

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Trial Metric

α

Workers Coordinator

Global Task Queue

20 40 60 80 100 Epochs Accuracy lr=0.0021,layers=5 lr=0.01,layers=2 lr=0.01,layers=10 lr=0.001,layers=15 lr=0.001,layers=25 lr=0.019,layers=5 lr=0.001,layers=7 lr=0.01,layers=4 lr=0.0014,layers=3 lr=0.05,layers=1

Trials Progress Black-Box Optimziers minx f(x) x ∈ S

Suggested tasks Results Early Stop Heartbeats Suggested tasks Results Early Stop Heartbeats Suggested tasks Results Heartbeats Checkpoints

SLIDE 61

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: API

class RandomSearch(AbstractOptimizer): def initialize(self):

# ..

def get_suggestion(self, trial=None):

# ..

def finalize_experiment(self, trials):

# ..

def early_check(self, to_check , trials , direction):

# ..

SLIDE 62

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: API

Users have to extend the AbstractOptimizer base class to implement their own algorithms.

class RandomSearch(AbstractOptimizer): def initialize(self):

# ..

def get_suggestion(self, trial=None):

# ..

def finalize_experiment(self, trials):

# ..

def early_check(self, to_check , trials , direction):

# ..

SLIDE 63

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: API

Users have to extend the AbstractOptimizer base class to implement their own algorithms. Initializing search space

class RandomSearch(AbstractOptimizer): def initialize(self):

# ..

def get_suggestion(self, trial=None):

# ..

def finalize_experiment(self, trials):

# ..

def early_check(self, to_check , trials , direction):

# ..

SLIDE 64

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: API

Users have to extend the AbstractOptimizer base class to implement their own algorithms. Initializing search space Suggestions to be evaluated by workers

class RandomSearch(AbstractOptimizer): def initialize(self):

# ..

def get_suggestion(self, trial=None):

# ..

def finalize_experiment(self, trials):

# ..

def early_check(self, to_check , trials , direction):

# ..

SLIDE 65

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: API

Users have to extend the AbstractOptimizer base class to implement their own algorithms. Initializing search space Suggestions to be evaluated by workers Aggregate results

class RandomSearch(AbstractOptimizer): def initialize(self):

# ..

def get_suggestion(self, trial=None):

# ..

def finalize_experiment(self, trials):

# ..

def early_check(self, to_check , trials , direction):

# ..

SLIDE 66

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: API

Users have to extend the AbstractOptimizer base class to implement their own algorithms. Initializing search space Suggestions to be evaluated by workers Aggregate results Configure early-stop policy

class RandomSearch(AbstractOptimizer): def initialize(self):

# ..

def get_suggestion(self, trial=None):

# ..

def finalize_experiment(self, trials):

# ..

def early_check(self, to_check , trials , direction):

# ..

SLIDE 67

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

MAGGY: API

from maggy import experiment from maggy.searchspace import Searchspace from maggy.randomsearch import RandomSearch sp = Searchspace(argument_param=(’DOUBLE’, [1, 5])) rs = RandomSearch(5, sp) result = experiment.launch(train_fn , sp, optimizer=rs, num_trials=5, name=’test’, direction="max")

SLIDE 68

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

SUMMARY

◮ Deep Learning is going distributed ◮ Algorithms for DDL are available in several frameworks ◮ Applying DDL in practice brings a lot of operational

complexity

◮ Hopsworks is a platform for scale out deep learning and

big data processing

◮ Hopsworks makes DDL simpler by providing simple

abstractions for distributed training, parallel experiments and much more..

@hopshadoop www.hops.io @logicalclocks www.logicalclocks.com We are open source: https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops

Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, Alex Ormenisan, and Rasmus Toivonen. And our interns: Moritz Meister and Sina Sheikholeslami.

SLIDE 69

INTRODUCTION HOPSWORKS DISTRIBUTED DEEP LEARNING PARALLEL BLACK-BOX OPTIMIZATION SUMMARY

REFERENCES

◮ Example notebooks https:

//github.com/logicalclocks/hops-examples

◮ HopsML8 ◮ Hopsworks9 ◮ Hopsworks’ feature store10 ◮ Maggy

https://github.com/logicalclocks/maggy

2em18 Logical Clocks AB. HopsML: Python-First ML Pipelines. https : / / hops . readthedocs . io / en / latest/hopsml/hopsML.html. 2018. 2em19 Jim Dowling. Introducing Hopsworks. https : / / www . logicalclocks . com / introducing - hopsworks/. 2018. 2em110 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines? https://www. logicalclocks.com/feature-store/. 2018.