ML through Streaming at Sherin Thomas @doodlesmt Stopping a - - PowerPoint PPT Presentation

ml through streaming at
SMART_READER_LITE
LIVE PREVIEW

ML through Streaming at Sherin Thomas @doodlesmt Stopping a - - PowerPoint PPT Presentation

QCON LONDON 2020 ML through Streaming at Sherin Thomas @doodlesmt Stopping a Phishing Attack Hello Alex, Im Tracy calling from Lyft HQ. This month were awarding $200 to all 4.7+ star drivers. Congratulations! Hey Tracy, thanks! Np!


slide-1
SLIDE 1

ML through Streaming at

QCON LONDON 2020

Sherin Thomas

@doodlesmt

slide-2
SLIDE 2
slide-3
SLIDE 3

Stopping a Phishing Attack

slide-4
SLIDE 4

Hello Alex, I’m Tracy calling from Lyft

  • HQ. This month we’re awarding $200 to

all 4.7+ star drivers. Congratulations! Hey Tracy, thanks! Np! And because we see that you’re in a ride, we’ll dispatch another driver so you can park at a safe location…. ….Alright your passenger will be taken care of by another driver

slide-5
SLIDE 5

Before we can credit you the award, we just need to quickly verify your identity. We’ll now send you a verification text. Can you please tell us what those numbers are…... 12345

slide-6
SLIDE 6

Fingerprinting Fraudulent Behaviour

slide-7
SLIDE 7

Request Ride ... Driver Contact Cancel Ride ….. Something

Sequence of User Actions

Reference: Fingerprinting Fraudulent Behaviour

slide-8
SLIDE 8

Reference: Fingerprinting Fraudulent Behaviour

Red Flag

Request Ride ... Driver Contact Cancel Ride ….. Something

Sequence of User Actions

slide-9
SLIDE 9

SELECT user_id, TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action

Temporally ordered user action sequence

slide-10
SLIDE 10

SELECT user_id, TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action Last x events sorted by time

Temporally ordered user action sequence

slide-11
SLIDE 11

SELECT user_id, TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action Historic context is also important (large lookback)

Temporally ordered user action sequence

slide-12
SLIDE 12

SELECT user_id, TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action Event time processing

Temporally ordered user action sequence

slide-13
SLIDE 13

Make streaming features accessible for ML use cases

slide-14
SLIDE 14
slide-15
SLIDE 15

Flink

slide-16
SLIDE 16
  • Low latency stateful operations on streaming data - in the order or

milliseconds

  • Event time processing - replayability, correctness
  • Exactly once processing
  • Failure recovery
  • SQL Api

Apache Flink

slide-17
SLIDE 17

Event Ingestion Pipeline

HDFS S3 Event Ingestion Pipeline Kinesis Kinesis Kinesis Kinesis Filters (Offline/Batch) (Streaming)

{ “ride_req”, “user_id”: 123, “event_time”: t0 }

slide-18
SLIDE 18

Credit: The Beam Model by Tyler Akidau and Frances Perry

Processing Time vs Event Time

Processing time

System time when the event is processed -> determined by processor

Event time

Logical time when the event occurred

  • > part of event metadata
slide-19
SLIDE 19

episode IV 1977 1980 1983 1999 2002 2005 2015 2016 2018 2019 episode V episode VI episode I episode II episode III episode vii ROGUE ONE III.5 episode viii episode IX

Event Time Processing Time

slide-20
SLIDE 20

Example: integer sum over 2 min window

Credit: The Beam Model by Tyler Akidau and Frances Perry

slide-21
SLIDE 21

Watermark

12:09 12:08 12:03 12:05 12:04 12:01 12:02 W = 12:05 W = 12:02 W = 12:10

slide-22
SLIDE 22

Example: integer sum over 2 min window

Credit: The Beam Model by Tyler Akidau and Frances Perry

slide-23
SLIDE 23

Example: integer sum over 2 min window

Credit: The Beam Model by Tyler Akidau and Frances Perry

slide-24
SLIDE 24

Usability

slide-25
SLIDE 25

1

Model Development

3

Data Quality

5

Compute Resources

2

Feature Engineering

4

Scheduling, Execution, Data Collection

What Data Scientists care about

slide-26
SLIDE 26

Data Input Data Prep Modeling Deployment

DATA DISCOVERY NORMALIZE AND CLEAN UP DATA EXTRACT & TRANSFORM FEATURES LABEL DATA MAINTAIN EXTERNAL FEATURE SETS TRAIN MODELS EVALUATE AND OPTIMIZE DEPLOY MONITOR & VISUALIZE PERFORMANCE

ML Workflow

slide-27
SLIDE 27

Data Input Data Prep Modeling Deployment

DATA DISCOVERY NORMALIZE AND CLEAN UP DATA EXTRACT & TRANSFORM FEATURES LABEL DATA MAINTAIN EXTERNAL FEATURE SETS TRAIN MODELS EVALUATE AND OPTIMIZE DEPLOY MONITOR & VISUALIZE PERFORMANCE

ML Workflow

slide-28
SLIDE 28

User Plane

Dryft UI

Data Plane

Kafla DynamoDB Druid Hive

Control Plane

Query Analysis Job Cluster Data Discovery

Dryft! - Self Service Streaming Framework

Elastic Search

slide-29
SLIDE 29

Declarative Job Definition

{ “retention”: {}, “lookback”: {}, “stream”: { “kinesis”: user_activities }, “features”: { “user_activity_per_geohash”: { “type”: “int” “version”: 1, “description”: “user activities per geohash” } } }

Job Config

SELECT geohash, COUNT(*) AS total_events, TUMBLE_END( rowtime, INTERVAL ‘1’ hour ) FROM event_user_action GROUP BY TUMBLE( rowtime, INTERVAL ‘1’ hour )

Flink SQL

slide-30
SLIDE 30

Feature Fanout Feature Fanout User Apps

Kinesis

Sources

Kinesis S3 Kinesis

Sinks

DynamoDB Hive

slide-31
SLIDE 31

Eating our own dogfood

slide-32
SLIDE 32

SELECT

  • - this will be used in keyBy

CONCAT_WS('_', feature_name, version, id), feature_data, CONCAT_WS('_', feature_name, version) AS feature_definition,

  • ccurred_at

FROM features

Feature Fanout App - also uses Dryft

{ “stream”: { “kinesis”: feature_stream }, “sink”: { “feature_service_dynamodb”: { “write_rate”: 1000, “retry_count”: 5 } } }

slide-33
SLIDE 33
slide-34
SLIDE 34

Deployment

slide-35
SLIDE 35

Previously...

  • Ran on AWS EC2 using custom deployment
  • Separate autoscaling groups for JobManager and

Taskmanagers

  • Instance provisioning done during deployment
  • Multiple jobs(60+) running on the same cluster
slide-36
SLIDE 36

Multi tenancy hell!!

slide-37
SLIDE 37

Kubernetes Based Deployment

Managing Flink on Kubernetes

TM TM TM

JM

TM TM TM TM TM TM

JM

TM TM TM TM

JM App 1 App 2 App 3

slide-38
SLIDE 38

Flink-K8s-Operator

Managing Flink on Kubernetes

Custom Resource Descriptor Flink Operator

TM TM TM TM TM TM

JM

slide-39
SLIDE 39

Custom Resource Descriptor

apiVersion: flink.k8s.io/v1alpha kind: FlinkApplication metadata: name: flink-speeds-working-stats namespace: flink spec: image: ‘100,dkr.ecr.us-east-1.amazonaws.com/abc:xyz’ flinkJob: jarName: name.jar parallelism: 10

taskManagerConfig: { resources: { limits: { memory: 15Gi, cpu: 4 } }, replicas: num_task_managers, taskSlots: NUM_SLOTS_PER_TASK_MANAGER, envConfig: {...}, }

  • Custom resource

represents Flink application

  • Docker Image contains

all dependencies

  • CRD modifications trigger

update (includes parallelism and other Flink configuration properties)

slide-40
SLIDE 40

Validate Compute Resources Generate CRD Dryft Conf

  • Flink

Operator

TM TM TM TM TM TM

JM

Kubernetes CRD

slide-41
SLIDE 41

Flink on Kubernetes

Managing Flink on Kubernetes - by Anand and Ketan

  • Separate Flink cluster for each application
  • Resource allocation customized per job - at job creation time
  • Scales to 100s of Flink applications
  • Automatic application updates
slide-42
SLIDE 42

Bootstrapping

slide-43
SLIDE 43

SELECT passenger_id, COUNT(ride_id) FROM event_ride_completed GROUP BY passenger_id, HOP(rowtime, INTERVAL ‘30’ DAY, INTERVAL ‘1’ HOUR)

What is bootstrapping?

slide-44
SLIDE 44
  • 6
  • 3
  • 4
  • 2
  • 5
  • 7

6 4 5 3 2 1

  • 1

Current Time Historic Data Future Data

Read historic data to ‘bootstrap’ the program with 30 days worth of data. Now your program returns results on day 1. But what if the source does not have all 30 days worth of data?

Bootstrap with historic data

slide-45
SLIDE 45

Read historic data from persistent store(AWS S3) and streaming data from Kafka/Kinesis

Solution - Consume from two sources

Bootstrapping state in Apache Flink - Hadoop Summit

(historic) (real-time) Business Logic Sink < Target Time >= Target Time

slide-46
SLIDE 46
slide-47
SLIDE 47

Job starts

slide-48
SLIDE 48

Bootstrapping

  • ver
slide-49
SLIDE 49

Detect Bootstrap Completion

Job sends a signal to the control plane once watermark has progressed beyond a point where we no longer need historic data

“Update” Job with lower parallelism but same job graph

Control plane cancels job with savepoint and starts it again from savepoint but with a much lower parallelism

Start Job

With a higher parallelism for fast bootstrapping

slide-50
SLIDE 50

Output volume spike during bootstrapping

Bootstrapping

slide-51
SLIDE 51

Output volume spike during bootstrapping

  • Features need to be fresh but eventually complete
  • Smooth out data writes during bootstrap to match throughput
  • Write features produced during bootstrapping separately

Low Priority Kinesis Stream High Priority Kinesis Stream bootstrap steady state Idempotent Sink

slide-52
SLIDE 52

What about skew between historic and real-time data?

slide-53
SLIDE 53

Skew

Watermark = Kinesis

slide-54
SLIDE 54

Solution: Source synchronization

partition 1 partition 2 consumer partition 3 partition 4 consumer

global watermark global watermark

global watermark shared state

FLINK-10887, FLINK-10921, FLIP-27

slide-55
SLIDE 55

Now...

slide-56
SLIDE 56
  • 120+ features
  • Features available in DynamoDB(real time point lookup), Hive(offline analysis),

Druid(real time analysis) and more…

  • Time to write, test and deploy a feature is < 1/2 day
  • p99 latency <5 seconds
  • Coming Up - Python Support!
slide-57
SLIDE 57

Thank you!

Sherin Thomas

@doodlesmt

slide-58
SLIDE 58

Later

slide-59
SLIDE 59

Backfill

Real-time Scoring Data Live Data Lorem 3 Lorem 1 Training Data Historic Data

  • What if one implementation could provide the training time and scoring time feature values?

○ Batch processing mode to backfill historic values for training ○ Stream processing mode to generate values in real-time for model scoring

  • Enable delivery of consistent features between training and scoring
slide-60
SLIDE 60
  • Green/Blue deploy - zero downtime deploys
  • “Auto” scaling of Flink cluster and/or job parallelism
  • Feature library