DSC 102 Systems for Scalable Analytics Arun Kumar Topic 7: ML - - PowerPoint PPT Presentation

dsc 102 systems for scalable analytics
SMART_READER_LITE
LIVE PREVIEW

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 7: ML - - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 7: ML Deployment Not included for Final Exam Slide Content ACKs: Alkis Polyzotis, Manasi Vartak 1 The Lifecycle of ML-based Analytics Feature Engineering Data acquisition Model


slide-1
SLIDE 1

Topic 7: ML Deployment Not included for Final Exam

Arun Kumar

1

DSC 102
 Systems for Scalable Analytics

Slide Content ACKs: Alkis Polyzotis, Manasi Vartak

slide-2
SLIDE 2

2

The Lifecycle of ML-based Analytics

Data acquisition Data preparation Feature Engineering Training & Inference Model Selection Model Serving Monitoring

slide-3
SLIDE 3

3

Deployment Stage of Data Science

❖ Data science does not exist in a vacuum. It must interplay with the data-generating process and prediction application ❖ Deploy Stage: Integrate the trained prediction function(s) with production environment, e.g., offline inference in a data system, online inference on a Web platform / IoT / etc. ❖ Typically, data scientist must work with “DevOps” engineers

  • r “MLOps” engineers to achieve this
slide-4
SLIDE 4

4

https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

ML in Academia vs Production

What you classes on statistics, ML, AI, etc. cover! ☺

slide-5
SLIDE 5

5

Deployment Stage of Data Science

❖ Deployment stage typically involves 5 main activities in sync with other stages:

  • 1. Packaging and Orchestration
  • 2. Prediction Serving
  • 3. Data Validation
  • 4. Prediction Monitoring
  • 5. Versioning
slide-6
SLIDE 6

6

  • 1. Packaging and Orchestration

❖ Basic Goal: Bundle up software to deploy with its dependencies into a lightweight standalone executable software that can run almost seamlessly across different OSs and hardware environments ❖ Most common approach today: Containerization ❖ Not specific to ML deployment but highly general ❖ Older generation approach called “virtual machines” included OS too and were bulky and slow ❖ Docker and Kubernetes are most popular options today

slide-7
SLIDE 7

7

https://medium.com/edureka/kubernetes-vs-docker-45231abeeaf1

  • 1. Packaging and Orchestration
slide-8
SLIDE 8

8

❖ Often, one might need to deploy end-to-end pipelines with effectively independent contrainerized software modules ❖ Workflow orchestration tools help handle complex pipelines ❖ Can specify time constraints, operation constraints, etc.

  • 1. Packaging and Orchestration
slide-9
SLIDE 9

9

❖ Cloud providers are also starting to make it easier to package and deploy prediction software, e.g., Model Endpoint in AWS Sagemaker ❖ Data scientists must look out for

  • rganization’s

tools and services

  • 1. Packaging and Orchestration
slide-10
SLIDE 10

10

❖ Basic Goal: Make ML inference fast and potential co-

  • ptimize with serving environment/infra.

❖ Typically automated tools; so data scientists only needs to know what systems available and how to use them ❖ 3 main kinds of systems: ❖ Program optimization of prediction function to improve hardware utilization, e.g., ONNX Runtime or Apache TVM ❖ Batch optimization of many concurrent prediction requests to balance latency and throughput better to improve hardware utilization, e.g., AWS SageMaker ❖ New hardware optimized for inference, e.g., TPUs

  • 2. Prediction Serving
slide-11
SLIDE 11

11

❖ Basic Goal: Ensure the data fed into prediction function conforms to its expectations on, say, schema/syntax/shape, integrity constraints (e.g., value ranges or domains), etc. ❖ Needs to be in lock step with data sourcing stage: acquiring, re-organizing, cleaning, and feature extraction ❖ Industry is starting to build platforms to make this process more rigorous and reusable, e.g., TensorFlow Extended ❖ Data scientists must learn their organization’s data validation practices and tools/APIs ❖ Also covered in Alkis’s guest lecture; further reading: https:// mlsys.org/Conferences/2019/doc/2019/167.pdf

  • 3. Data Validation
slide-12
SLIDE 12

12

❖ Basic Goal: Ensure the prediction functions are working as intended by data scientist; “silent failures” can happen due to concept drifts, i.e., data distribution has deviated significantly from when prediction function was built! ❖ Example: Sudden world event changes Web user behavior drastically, e.g., WHO declares pandemic! ☺ ❖ Needs to be in lock step with model building stage ❖ Industry today uses ad hoc statistical approaches ❖ Data scientists must look out for organizations’ monitoring practices, since it affects the lifecycle loop frequency ❖ Also covered in Alkis’s guest lecture; further reading: https:// mlsys.org/Conferences/2019/doc/2019/167.pdf

  • 4. Prediction Monitoring
slide-13
SLIDE 13

13

❖ Basic Goal: Just like regular code, prediction software must be versioned and tracked for teams to ensure consistency across time and employees, as well as for auditing sake, ability to “rollback” to a safer state, etc. ❖ But unlike regular code, prediction software has 3 more dependencies other than just code: datasets (train/val/test), configuration (e.g., hyper-parameters), and environment (hardware/software, since that can affect accuracy too) ❖ Research and industry are barely starting to figure this out ❖ Data scientists must look out versioning best practices/tools ❖ Covered in Manasi’s guest lecture; https://blog.verta.ai/blog/ how-to-move-fast-in-ai-without-breaking-things

  • 5. Versioning