ML Infra at an Early-Stage Feature Service Spencer Barton, Data - - PowerPoint PPT Presentation

ml infra at an early stage
SMART_READER_LITE
LIVE PREVIEW

ML Infra at an Early-Stage Feature Service Spencer Barton, Data - - PowerPoint PPT Presentation

ML Infra at an Early-Stage Feature Service Spencer Barton, Data Scientist April 2019 2 Branch in the Numbers 3 Our mission is to deliver world-class financial services to the mobile generation. 4 From Install to Approval in Minutes 1


slide-1
SLIDE 1

ML Infra at an Early-Stage

Spencer Barton, Data Scientist April 2019

Feature Service

slide-2
SLIDE 2

Branch in the Numbers

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

Our mission is to deliver world-class financial services to the mobile generation.

4

slide-5
SLIDE 5

From Install to Approval in Minutes 1

ANSWER 3 QUESTIONS TO REGISTER

KYC checks with external APIs, mobile data mined and analysed.

2

ELIGIBLE LOAN OFFERS ARE DISPLAYED

Credit score calculated in seconds.

3

DEPOSIT TO BANK ACCOUNT OR MOBILE WALLET

Repayment schedule set and monitored.

slide-6
SLIDE 6

We predict probability

  • f repayment

How Branch works behind the scenes

Collect Phone Data Generate Features Credit Model We collect

  • Text messages
  • Installed apps
  • Contact lists
  • In-app events

We extract

  • Bank balance
  • Number of contacts
  • Read the FAQ
  • Installed Facebook app

6

slide-7
SLIDE 7

How do I build ML into my product?

7

slide-8
SLIDE 8

5 data scientists

Big Firms Can Build Custom ML Infrastructure

Source: Bighead - Airbnb’s End-to-End Machine Learning Platform 8

5 engineers 5 engineers 10 engineers 10 engineers 2 product managers

slide-9
SLIDE 9

We too can build infrastructure but must be strategic.

Can the rest of us do machine learning?

Build a Feature Service!

9

slide-10
SLIDE 10
  • Faster development of new features
  • Reduce bugs with consistent feature definitions
  • Speed-up slow feature calculations
  • Easy feature discovery and sharing

What does a feature service do for me?

10

slide-11
SLIDE 11

Where do you start?

11

slide-12
SLIDE 12

You want to start basic

https://en.wikipedia.org/wiki/Linear_regression 12

slide-13
SLIDE 13

You will gradually mature your ML

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491 13

slide-14
SLIDE 14

The basics will only get you so far

14

slide-15
SLIDE 15

What do you focus on beyond the basics?

Gather Data Build Features Train Model Serve Model

15

slide-16
SLIDE 16

Our data sources were in ok shape but

  • Differences in features between dev, training and production lead to bugs
  • Inconsistent feature definitions lead to bugs
  • Feature creation was a training bottleneck

We needed to improve our features

16

slide-17
SLIDE 17

We invested in infrastructure to improve features. We decided to build a Feature Service

17

slide-18
SLIDE 18

What is a Feature Service?

Feature Service

Get features for user 90234 Feature vector for user 90234 { “average_bank_balance”: 324090, “number_referrals”: 15, “read_faq”: true }

A Feature Service computes a feature vector for a specific object at a specific time.

18

slide-19
SLIDE 19

Features are computed relative to a timestamp

Feature Service

Get features for user 90234 on 2016-10-2 Feature vector for user 90234 on 2016-10-2 { “average_bank_balance”: 504090, “number_referrals”: 0, “read_faq”: false }

19

slide-20
SLIDE 20

Features are accessed by a simple API

GET feature/bank_balance/v0_1?pid=12314 GET feature/bank_balance/v0_3?pid=1214&date=2017-12-3 GET feature/loan_repayment/v0_1?pid=3531 pid = primary id, like user id feature name feature version date for historical features

20

slide-21
SLIDE 21

Why build a custom solution?

Gather Data Build Features Train Model Serve Model Feature Service

21

slide-22
SLIDE 22

What are we building?

Feature Service

  • Server infrastructure
  • Cache infrastructure
  • A Python framework

22

slide-23
SLIDE 23

Data source dependencies were messy

Write Read Inference Training Development

23

Raw Data Source A Raw Data Source B

slide-24
SLIDE 24

We abstracted complicated data sources

Write Read Inference Training Development Feature Service Raw Data Source A Raw Data Source B

24

slide-25
SLIDE 25

Features were being created all over the place

Write Read Inference Training Development

25

Raw Data Source A Raw Data Source B

slide-26
SLIDE 26

Every step of ML shares consistent features

Write Read Inference Training Development Feature Service

26

slide-27
SLIDE 27

New models were recreating features

Write Read Model 1 Model 2

27

Raw Data Source A Raw Data Source B

slide-28
SLIDE 28

ML models now share the same features

Write Read Model 1 Model 2 Feature Service

28

Model 3

slide-29
SLIDE 29
  • Abstracted data sources
  • Shared features
  • Consistent features

Now onto storage….

The Feature Service server helps a lot

29

slide-30
SLIDE 30

Features were computed once and forgotten

30

Inference for user 3 Compute all features Compute all the same features again Time Inference for user 3 Inference for user 3 Compute all the same features again

slide-31
SLIDE 31

Feature Storage

We built feature storage and caching

Write Read Feature Service Analytics Monitoring

31

slide-32
SLIDE 32

Write Read Use cached features for model training

We sped up training with a cache

Inference Training Model Iteration Feature Storage Calculate and cache features in production Use cached features for model development Time

32

slide-33
SLIDE 33
  • Remove recomputation of features
  • Enable analytics and monitoring
  • Increase training speed

Feature storage helps too

33

slide-34
SLIDE 34

Feature Storage

We built with simple components

Write Read Feature Service

34

Flask App Deployed on AWS Elastic Beanstalk AWS DynamoDB

slide-35
SLIDE 35

Feature Storage

Simple infrastructure solved many problems

Write Read Inference Training Development Feature Service Raw Data Source A Raw Data Source B

35

Simple (Flask) App Data abstraction Caching Analytics Monitoring Common source

slide-36
SLIDE 36

How do we actually generate features?

Write Read Development Feature Service Raw Data Source

36

Text messages Bank balance

slide-37
SLIDE 37

We built a framework

Features are composed of

  • One or more Extractors which pull data from a Raw Data Source
  • Many Transformers which convert the data into a numeric or categorical

features

Feature: average_bank_balance Extract SMS Raw Data Source S3 Select bank messages Pull out values Average

“average_bank_balance”: 324090

Transformers Extractors

37

slide-38
SLIDE 38

Extractors and Transformers are shared

Feature: maximum_bank_balance Extract SMS Select bank messages Pull out values Maximum

“maximum_bank_balance”: 500034

38

Feature: average_bank_balance Extract SMS Raw Data Source S3 Select bank messages Pull out values Average

“average_bank_balance”: 324090

slide-39
SLIDE 39

Everything is built on base classes with automated testing As flexible as Python Custom one-off transforms Features are built on versioned extracts and transforms Chain of transformations

Framework example

39

slide-40
SLIDE 40

Feature versions support new models

Write Read Old Credit Model New Credit Model Feature Service Flask App Buggy feature bank_balance:v1 Bug fixed: bank_balance:v2

40

slide-41
SLIDE 41
  • Feature definitions are consistent
  • New features are easy to build from shared components
  • Versioning allows backwards compatibility and bug fixes

The framework makes development easy

41

slide-42
SLIDE 42

Feature Storage

The Feature Service solves many problems

Write Read Inference Training Development Feature Service Raw Data Source A Raw Data Source B

42

Simple (Flask) App Data abstraction Caching Analytics Monitoring Common source Framework: Consistency Easy development Versioning

slide-43
SLIDE 43

Should I build a Feature Service?

  • Is feature quality a problem for you?
  • Are your data sources complex and varied?
  • Do you want to support multiple models?
  • Are your features difficult to compute?

43

slide-44
SLIDE 44
  • Feature generation time reduced!
  • Fixed a lot of bugs by using the framework!
  • New models without remaking features!
  • New data scientists can contribute within a week of joining!
  • And our model performance has improved!

We’re benefitting from our Feature Service

44

slide-45
SLIDE 45
  • You don’t have to be a big company to use ML infrastructure
  • But your resources are limited so be strategic
  • And invest in a Feature Service!
  • Stay informed because the landscape changes fast

○ Airbnb Big Head may be open sourced soon

What should I take away?

45

slide-46
SLIDE 46

The Team Dennis Van Der Staay Dave Bernthal Ting Ting Liu Nick Handel Spencer Barton

46

slide-47
SLIDE 47

Thank You! Spencer Barton spencer@branch.co

47

slide-48
SLIDE 48

Appendix

48

slide-49
SLIDE 49
  • Nick Handel delivering an earlier version of this presentation
  • Varant Zanoyan, Zipline at Airbnb
  • Uber’s Michelangelo

Who else is talking about Feature Services?

49