Human Centric Human Centric Machine Learning Infrastructure - - PowerPoint PPT Presentation

human centric human centric machine learning
SMART_READER_LITE
LIVE PREVIEW

Human Centric Human Centric Machine Learning Infrastructure - - PowerPoint PPT Presentation

Human Centric Human Centric Machine Learning Infrastructure Machine Learning Infrastructure @ Ville Tuulos QCon SF, November 2018 Meet Alex, a new chief data scientist at Caveman Cupcakes Meet Alex, a new chief data scientist at Caveman


slide-1
SLIDE 1

Human Centric Human Centric Machine Learning Infrastructure Machine Learning Infrastructure @

Ville Tuulos QCon SF, November 2018

slide-2
SLIDE 2
slide-3
SLIDE 3

Meet Alex, a new chief data scientist at Caveman Cupcakes Meet Alex, a new chief data scientist at Caveman Cupcakes

You are hired!

slide-4
SLIDE 4

We need a dynamic pricing model.

slide-5
SLIDE 5

We need a dynamic pricing model.

Optimal pricing model

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Great job! The model works perfectly!

slide-9
SLIDE 9

Could you predict churn too?

slide-10
SLIDE 10

Optimal pricing model Optimal churn model Alex's model

slide-11
SLIDE 11

Optimal pricing model Optimal churn model Alex's model

slide-12
SLIDE 12

Good job again! Promising results!

slide-13
SLIDE 13

Can you include a causal attribution model for marketing?

slide-14
SLIDE 14

Optimal pricing model Optimal churn model Alex's model Attribution model

slide-15
SLIDE 15

Are you sure these results make sense?

slide-16
SLIDE 16

Take two Take two

slide-17
SLIDE 17

Meet the new data science team at Caveman Cupcakes Meet the new data science team at Caveman Cupcakes

You are hired!

slide-18
SLIDE 18

Pricing model Churn model Attribution model

slide-19
SLIDE 19

VS

slide-20
SLIDE 20

the human is the bottleneck the human is the bottleneck

VS

slide-21
SLIDE 21

the human is the bottleneck the human is the bottleneck the human is the bottleneck the human is the bottleneck

VS

slide-22
SLIDE 22
slide-23
SLIDE 23

Build Build

slide-24
SLIDE 24

Data Warehouse Data Warehouse

Build Build

slide-25
SLIDE 25

Data Warehouse Data Warehouse Compute Resources Compute Resources

Build Build

slide-26
SLIDE 26

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler

Build Build

slide-27
SLIDE 27

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler Versioning Versioning

Build Build

slide-28
SLIDE 28

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler Versioning Versioning Collaboration Tools Collaboration Tools

Build Build

slide-29
SLIDE 29

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler Versioning Versioning Collaboration Tools Collaboration Tools Model Deployment Model Deployment

Build Build

slide-30
SLIDE 30

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler Versioning Versioning Collaboration Tools Collaboration Tools Model Deployment Model Deployment Feature Engineering Feature Engineering

Build Build

slide-31
SLIDE 31

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler Versioning Versioning Collaboration Tools Collaboration Tools Model Deployment Model Deployment Feature Engineering Feature Engineering ML Libraries ML Libraries

Build Build

slide-32
SLIDE 32

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler Versioning Versioning Collaboration Tools Collaboration Tools Model Deployment Model Deployment Feature Engineering Feature Engineering ML Libraries ML Libraries

How much How much data scientist data scientist cares cares

Build Build

slide-33
SLIDE 33

Data Warehouse Data Warehouse Compute Resources Compute Resources Job Scheduler Job Scheduler Versioning Versioning Collaboration Tools Collaboration Tools Model Deployment Model Deployment Feature Engineering Feature Engineering ML Libraries ML Libraries

How much How much data scientist data scientist cares cares How much How much infrastructure infrastructure is needed is needed

Build Build

slide-34
SLIDE 34

Deploy Deploy

slide-35
SLIDE 35

Deploy Deploy

No plan survives contact with enemy No plan survives contact with enemy

slide-36
SLIDE 36

Deploy Deploy

No plan survives contact with enemy No plan survives contact with enemy

No model survives contact with reality No model survives contact with reality

slide-37
SLIDE 37
  • ur ML infra supports
  • ur ML infra supports

two two human human activities: activities:

building building and and deploying deploying

data science workflows. data science workflows.

slide-38
SLIDE 38
slide-39
SLIDE 39

Screenplay Analysis Using NLP Fraud Detection Title Portfolio Optimization Estimate Word-of-Mouth Effects Incremental Impact of Marketing Classify Support Tickets Predict Quality of Network Content Valuation Cluster Tweets Intelligent Infrastructure Machine Translation Optimal CDN Caching Predict Churn Content Tagging Optimize Production Schedules

slide-40
SLIDE 40

Notebooks: Notebooks: Nteract Nteract Job Scheduler: Job Scheduler: Meson Meson Compute Resources: Compute Resources: Titus Titus Query Engine: Query Engine: Spark Spark Data Lake: Data Lake: S3 S3 ML Lib ML Libraries: R raries: R, XGBoost, TF etc. , XGBoost, TF etc.

slide-41
SLIDE 41

Notebooks: Notebooks: Nteract Nteract Job Scheduler: Job Scheduler: Meson Meson Compute Resources: Compute Resources: Titus Titus Query Engine: Query Engine: Spark Spark Data Lake: Data Lake: S3 S3 ML Lib ML Libraries: R raries: R, XGBoost, TF etc. , XGBoost, TF etc.

slide-42
SLIDE 42

Notebooks: Notebooks: Nteract Nteract Job Scheduler: Job Scheduler: Meson Meson Compute Resources: Compute Resources: Titus Titus Query Engine: Query Engine: Spark Spark Data Lake: Data Lake: S3 S3

{

data compute prototyping

ML Lib ML Libraries: R raries: R, XGBoost, TF etc. , XGBoost, TF etc.

models

slide-43
SLIDE 43

Bad Old Days Bad Old Days

slide-44
SLIDE 44

Bad Old Days Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun! Data Scientist built an NLP model in Python. Easy and fun!

slide-45
SLIDE 45

Bad Old Days Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun! Data Scientist built an NLP model in Python. Easy and fun!

How to run at scale? Custom Titus executor.

slide-46
SLIDE 46

Bad Old Days Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun! Data Scientist built an NLP model in Python. Easy and fun!

How to run at scale? Custom Titus executor. How to access data at scale? Slow!

slide-47
SLIDE 47

Bad Old Days Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun! Data Scientist built an NLP model in Python. Easy and fun!

How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow!

slide-48
SLIDE 48

Bad Old Days Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun! Data Scientist built an NLP model in Python. Easy and fun!

How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow! How to expose the model to a custom UI? Custom web backend.

slide-49
SLIDE 49

Bad Old Days Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun! Data Scientist built an NLP model in Python. Easy and fun!

How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow! How to expose the model to a custom UI? Custom web backend.

Time to production: 4 months

slide-50
SLIDE 50

Bad Old Days Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun! Data Scientist built an NLP model in Python. Easy and fun!

How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow! How to expose the model to a custom UI? Custom web backend.

Time to production: 4 months

H

  • w

t

  • m
  • n

i t

  • r

m

  • d

e l s i n p r

  • d

u c t i

  • n

? H

  • w

t

  • i

t e r a t e

  • n

a n e w v e r s i

  • n

w i t h

  • u

t b r e a k i n g t h e p r

  • d

u c t i

  • n

v e r s i

  • n

? H

  • w

t

  • l

e t a n

  • t

h e r d a t a s c i e n t i s t i t e r a t e

  • n

h e r v e r s i

  • n
  • f

t h e m

  • d

e l s a f e l y ? How to debug yesterday's failed production run? How to backfill historical data? H

  • w

t

  • m

a k e t h i s f a s t e r ?

slide-51
SLIDE 51

Notebooks: Notebooks: Nteract Nteract Job Scheduler: Job Scheduler: Meson Meson Compute Resources: Compute Resources: Titus Titus Query Engine: Query Engine: Spark Spark Data Lake: Data Lake: S3 S3

{

data compute prototyping

ML Lib ML Libraries: R raries: R, XGBoost, TF etc. , XGBoost, TF etc.

models

ML Wrapping: ML Wrapping: Metaflow Metaflow

slide-52
SLIDE 52

Metaflow Metaflow

slide-53
SLIDE 53

Build Build

slide-54
SLIDE 54

def compute(input):

  • utput = my_model(input)

return output

  • utput

input compute

How to How to get started? get started?

slide-55
SLIDE 55

def compute(input):

  • utput = my_model(input)

return output

  • utput

input compute

How to How to get started? get started?

# python myscript.py

slide-56
SLIDE 56

from metaflow import FlowSpec, step class MyFlow(FlowSpec): @step def start(self): self.next(self.a, self.b) @step def a(self): self.next(self.join) @step def b(self): self.next(self.join) @step def join(self, inputs): self.next(self.end) MyFlow()

How to How to structure my code? structure my code?

start B A join end

slide-57
SLIDE 57

from metaflow import FlowSpec, step class MyFlow(FlowSpec): @step def start(self): self.next(self.a, self.b) @step def a(self): self.next(self.join) @step def b(self): self.next(self.join) @step def join(self, inputs): self.next(self.end) MyFlow()

How to How to structure my code? structure my code?

start B A join end

# python myscript.py run

slide-58
SLIDE 58

metaflow("MyFlow") %>% step( step = "start", next_step = c("a", "b") ) %>% step( step = "A", r_function = r_function(a_func), next_step = "join" ) %>% step( step = "B", r_function = r_function(b_func), next_step = "join" ) %>% step( step = "Join", r_function = r_function(join, join_step = TRUE),

How to How to deal with models deal with models written in R? written in R?

start B A join end

slide-59
SLIDE 59

metaflow("MyFlow") %>% step( step = "start", next_step = c("a", "b") ) %>% step( step = "A", r_function = r_function(a_func), next_step = "join" ) %>% step( step = "B", r_function = r_function(b_func), next_step = "join" ) %>% step( step = "Join", r_function = r_function(join, join_step = TRUE),

How to How to deal with models deal with models written in R? written in R?

start B A join end

# RScript myscript.R

slide-60
SLIDE 60

Metaflow Metaflow adoption adoption at Netflix at Netflix

134 projects on Metaflow

as of November 2018

slide-61
SLIDE 61

@step def start(self): self.x = 0 self.next(self.a, self.b) @step def a(self): self.x += 2 self.next(self.join) @step def b(self): self.x += 3 self.next(self.join) @step def join(self, inputs): self.out = max(i.x for i in inputs) self.next(self.end)

How to How to prototype and test prototype and test my code locally? my code locally?

start B A join end

x=0 x+=2 x+=3 max(A.x, B.x)

slide-62
SLIDE 62

@step def start(self): self.x = 0 self.next(self.a, self.b) @step def a(self): self.x += 2 self.next(self.join) @step def b(self): self.x += 3 self.next(self.join) @step def join(self, inputs): self.out = max(i.x for i in inputs) self.next(self.end)

How to How to prototype and test prototype and test my code locally? my code locally?

start B A join end

# python myscript.py resume B

x=0 x+=2 x+=3 max(A.x, B.x)

slide-63
SLIDE 63

@titus(cpu=16, gpu=1) @step def a(self): tensorflow.train() self.next(self.join) @titus(memory=200000) @step def b(self): massive_dataframe_operation() self.next(self.join)

How to How to get access to more CPUs, get access to more CPUs, GPUs, or memory? GPUs, or memory?

start B A join end

16 cores, 1GPU 200GB RAM

slide-64
SLIDE 64

@titus(cpu=16, gpu=1) @step def a(self): tensorflow.train() self.next(self.join) @titus(memory=200000) @step def b(self): massive_dataframe_operation() self.next(self.join)

How to How to get access to more CPUs, get access to more CPUs, GPUs, or memory? GPUs, or memory?

start B A join end

16 cores, 1GPU 200GB RAM

# python myscript.py run

slide-65
SLIDE 65

@step def start(self): self.grid = [’x’,’y’,’z’] self.next(self.a, foreach=’grid’) @titus(memory=10000) @step def a(self): self.x = ord(self.input) self.next(self.join) @step def join(self, inputs): self.out = max(i.x for i in inputs) self.next(self.end)

How to How to distribute work over distribute work over many parallel jobs? many parallel jobs?

start A join end

slide-66
SLIDE 66

40% 40% of projects run steps outside their dev environment.

  • f projects run steps outside their dev environment.

How quickly they start using Titus? How quickly they start using Titus?

slide-67
SLIDE 67

from metaflow import Table @titus(memory=200000, network=20000) @step def b(self): # Load data from S3 to a dataframe # at 10Gbps df = Table('vtuulos', 'input_table') self.next(self.end)

How to How to access large amounts of access large amounts of input data? input data?

start B A join end S3

slide-68
SLIDE 68

Case Study: Case Study: Marketing Cost per Incremental Watcher Marketing Cost per Incremental Watcher

  • 1. Build a separate model for every new title with marketing spend.

Parallel foreach.

  • 2. Load and prepare input data for each model.

Download Parquet directly from S3. Total amount of model input data: 890GB.

  • 3. Fit a model.

Train each model on an instance with 400GB of RAM, 16 cores. The model is written in R.

  • 4. Share updated results.

Collect results of individual models, write to a table. Results shown on a Tableau dashboard.

slide-69
SLIDE 69

Deploy Deploy

slide-70
SLIDE 70

# Access Savin's runs namespace('user:savin') run = Flow('MyFlow').latest_run print(run.id) # = 234 print(run.tags) # = ['unsampled_model'] # Access David's runs namespace('user:david') run = Flow('MyFlow').latest_run print(run.id) # = 184 print(run.tags) # = ['sampled_model'] # Access everyone's runs namespace(None) run = Flow('MyFlow').latest_run print(run.id) # = 184

How to How to version my results and version my results and access results by others? access results by others?

start B A join end david: sampled_model savin: unsampled_model

slide-71
SLIDE 71

How to How to deploy my workflow to deploy my workflow to production? production?

start B A join end

slide-72
SLIDE 72

How to How to deploy my workflow to deploy my workflow to production? production?

start B A join end

#python myscript.py meson create

slide-73
SLIDE 73

26% 26% of projects get deployed to the production scheduler.

  • f projects get deployed to the production scheduler.

How quickly the first deployment happens? How quickly the first deployment happens?

slide-74
SLIDE 74

How to How to monitor models and monitor models and examine results? examine results?

start B A join end

x=0 x+=2 x+=3 max(A.x, B.x)

slide-75
SLIDE 75

How to How to deploy results as deploy results as a microservice? a microservice?

start B A join end

x=0 x+=2 x+=3 max(A.x, B.x)

Metaflow hosting

from metaflow import WebServiceSpec from metaflow import endpoint class MyWebService(WebServiceSpec): @endpoint def show_data(self, request_dict): # TODO: real-time predict here result = self.artifacts.flow.x return {'result': result}

slide-76
SLIDE 76

How to How to deploy results as deploy results as a microservice? a microservice?

start B A join end

x=0 x+=2 x+=3 max(A.x, B.x)

Metaflow hosting

from metaflow import WebServiceSpec from metaflow import endpoint class MyWebService(WebServiceSpec): @endpoint def show_data(self, request_dict): # TODO: real-time predict here result = self.artifacts.flow.x return {'result': result}

{"result": 3}{ # curl http://host/show_data

slide-77
SLIDE 77

Case Study: Case Study: Launch Date Schedule Optimization Launch Date Schedule Optimization

  • 1. Batch optimize launch date schedules for new titles daily.

Batch optimization deployed on Meson.

  • 2. Serve results through a custom UI.

Results deployed on Metaflow Hosting.

  • 3. Support arbitrary what-if scenarios in the custom UI.

Run optimizer in real-time in a custom web endpoint.

slide-78
SLIDE 78

Metaflow Metaflow

slide-79
SLIDE 79

diverse diverse problems problems

slide-80
SLIDE 80

diverse diverse problems problems diverse diverse people people

slide-81
SLIDE 81

diverse diverse problems problems diverse diverse people people diverse diverse models models

slide-82
SLIDE 82

diverse diverse problems problems diverse diverse people people help help people people build build diverse diverse models models

slide-83
SLIDE 83

diverse diverse problems problems diverse diverse people people help help people people build build help help people people deploy deploy diverse diverse models models

slide-84
SLIDE 84

diverse diverse problems problems diverse diverse people people help help people people build build help help people people deploy deploy diverse diverse models models happy happy people people, healthy , healthy business business

slide-85
SLIDE 85

thank you! thank you!

@vtuulos @vtuulos vtuulos@netflix.com vtuulos@netflix.com

slide-86
SLIDE 86

Bruno Coldiori https://www.flickr.com/photos/br1dotcom/8900102170/ https://www.maxpixel.net/Isolated-Animal-Hundeportrait-Dog-Nature-3234285

Photo Credits Photo Credits