TRAINING DEEP LEARNING MODELS AT SCALE USING KUBERNETES Mitul - - PowerPoint PPT Presentation

training deep learning models at scale using kubernetes
SMART_READER_LITE
LIVE PREVIEW

TRAINING DEEP LEARNING MODELS AT SCALE USING KUBERNETES Mitul - - PowerPoint PPT Presentation

TRAINING DEEP LEARNING MODELS AT SCALE USING KUBERNETES Mitul Tiwari and Deepak Bobbarjung Introductions Outline Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture Our Conversational AI Platform


slide-1
SLIDE 1

Mitul Tiwari and Deepak Bobbarjung

TRAINING DEEP LEARNING MODELS AT SCALE USING KUBERNETES

slide-2
SLIDE 2

Introductions

slide-3
SLIDE 3

Outline

Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture

slide-4
SLIDE 4

Bot Training

#1 AI/NLP Model.

Bot Deployment

Build Once, Deploy Everywhere.

Our Conversational AI Platform

Bot Builder

No Coding Required.

AI/NLP

Intents Entities Attributes Speech to text Language Translation Sentiment Knowledge Base

slide-5
SLIDE 5

How to Make a Bot Intelligent?

  • Natural Language Understanding
  • Information Extraction
  • Entities
  • Intents
  • Actions
  • Natural Language Generation
  • Generating Response
slide-6
SLIDE 6

Deep Learning

  • Traditional Machine Learning
  • Human designed features and

representations

  • Optimize weights to combine
  • Deep Learning
  • Deep Neural Network
  • Learn good features and

multiple levels of representations

slide-7
SLIDE 7

Deep Learning for NLP

  • Language Translation
  • Image Captioning
  • Text Summarization
  • Parts-of-speech Tagging
  • Named Entity Recognition
  • Natural Language Generation
  • Question-Answering
  • Optical Character Recognition
  • Speech Recognition
  • Machine Reading Comprehension
slide-8
SLIDE 8

Neural Network for Word Embedding

  • Word Embedding: Word2Vec
  • Embed words in continues vector

space

  • Semantically similar words are mapped

to nearby points

  • Enables powerful operations
  • “King”-“Man”+”Woman” -> “Queen”
slide-9
SLIDE 9

Bag of Words - Curse of Dimensionality

  • Before word embeddings - Bag of words
  • Dictionary of words & counts in the text
  • Easy feature generation technique
  • Limitations
  • Hard to capture order of words
  • Curse of dimensionality - limited vocabulary - similar words don’t match
slide-10
SLIDE 10

Word Embeddings Cont’d

  • Word Embedding: mapping words to a

higher dimensional space, typically 200-500, e.g.,

  • W(‘King’) = (0.2, -0.4, 0.9, …)
  • W(‘Queen’) = (0.1, -0.3, 0.8, …)
  • Learn representations of words
  • How: two layer NN to learn word

representations by predicting validity of phrases

Example of similar word vectors

slide-11
SLIDE 11

Sequence Learning: Response Generation

  • Automated Response Generation
  • Sequence 2 Sequence Model
  • Recurrent Neural Network (RNN)
  • Long Short Term Memory Network (LSTM)
  • Example: GMail Smart Reply
  • Automated Response Suggestions
slide-12
SLIDE 12

Sequence Learning: RNN And LSTM

  • Recurrent Neural Network
  • Output of a module go into a module of same type (recurrent)
  • Good for capturing a sequence
  • Long Short Term Memory Network
  • Long running cell state: forget & add new values
  • Output: combination of cell state, previous output, and new

input

slide-13
SLIDE 13

Training Deep Learning Models for NLP

  • Intent Classification
  • Deep Learning — LSTM
  • Information Extraction
  • Named Entity Recognition (NER)
  • Slot attributes
  • Sentiment and Complaint Classifier
  • Knowledge Base & Semantic Search
  • Machine Reading Comprehension

Natural Language Understanding & Generation Analysis for Complaints Targeted Personalized Timely Notification Automatic Speech Recognition & Generation Deep Learning Entity Graph & Knowledge Base

Text Sentiment Notifications Speech

slide-14
SLIDE 14

Scaling Training Deep Learning Models For NLP

  • Off line: Started with a script for training models
  • Run Time: A service for prediction during runtime
  • However, the number of models are reaching in thousands
  • Hard to manage model training script for each of the bot

Conversation Plane (Run Time) Control Plane (Offline)

Users Interfaces

NLP Prediction (for example, Intent classification …) Train Models Store Models Training Data Load Models

slide-15
SLIDE 15

Outline

Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture

slide-16
SLIDE 16

Training plane

Conversation plane Control plane

Data

User Data Config Data Orchestration AI/NLP API Service

Configure Bots Train Bot

training job

Hi

Classify (Hi)

model

Greetings intent

Welcome to..

Passage AI Architecture

slide-17
SLIDE 17

Passage AI Architecture

Training plane (Jobs)

slide-18
SLIDE 18

When do we train a new model for a bot ?

  • When a new bot is created
  • When a bot is changed
  • utterances are added or modified
  • New training data is available
slide-19
SLIDE 19

# of Bots

125 250 375 500 August September October November

# of Bot changes per day

25 50 75 100 August September October November

slide-20
SLIDE 20

Why do we need a Jobs framework?

  • Run jobs at scale
  • Eliminate out of band scripts that tend to

become ‘tribal’.

  • APIs and UI for exposing jobs to our

customers in our Bot Builder UI.

  • Reporting and auditing around jobs.

Control plane API Service

Create job specification Run a job Show last 10 job runs

slide-21
SLIDE 21

Conversation plane

Why Kubernetes (K8S) For Our Microservices?

  • Scale and availability of our microservices

Orchestration Service

K8s service K8s hpa K8s deployment

Pod Pod Pod Pod Pod

Nginx

Pod Pod Pod Pod Pod Pod Pod

slide-22
SLIDE 22

On-Prem

Why Kubernetes?(Contd)

  • Cloud Agnostic and On-prem ready

Staging

Conversation plane Control plane

Production

Conversation plane Control plane

Integration

Conversation plane Control plane Conversation plane Control plane

$helm install passage-ai

Standby

Conversation plane Control plane

slide-23
SLIDE 23

Why Create The Jobs Framework In Kubernetes?

  • Handle scale and availability in the same

way as our microservices

Jobs Plane

Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod

  • Jobs should also be cloud-agnostic and
  • n-prem ready
  • Same set of tools for monitoring, logging

and auditing.

slide-24
SLIDE 24

Outline

Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture

slide-25
SLIDE 25

Example Job types in our system

  • Training deep learning models
  • Extracting and indexing knowledge

base articles

  • Nightly testing of our bots
slide-26
SLIDE 26

Control plane API

Job Specification

Create a job spec

Schedule: “Every Monday at 1 AM” Job type: “Training”

  • bot ID, training data

Training specific params: priority

Job Object

Control plane API

Trigger a job from a job spec

progress: 25 job_spec_id: <id>

  • data: < confusion matrix>

description: performing training state: in_progress

slide-27
SLIDE 27

Jobs plane

Control plane API Service

jobs job_specs

Jobs Service

training Q KB Index Q

Bot Testing Q

K8s deployment of training pods Pod1 Pod4 Pod2 Pod3

Update job progress

Jobs Architecture

Pod1 Pod2 Pod1 Pod2

Trigger training job from job spec Create a Job (params) A d d i t e m

  • n

Q Pickup item

gpu gpu

slide-28
SLIDE 28

Jobs plane

Control plane API Service

jobs job_specs

Jobs Service

trainingQ KB Index Q

K8s deployment of training pods Pod1 Pod4 Pod3 Pod2

U p d a t e j

  • b

p r

  • g

r e s s

T r i g g e r j

  • b

f r

  • m

j

  • b

s p e c

Jobs Architecture (scheduled jobs)

Pod1 Pod2 Pod1 Pod2

Scheduler

CronJob

Get job status

Bot Testing Q

slide-29
SLIDE 29

Alternatives that we considered

Apache Airflow Azkaban

slide-30
SLIDE 30

Thank You

Mitul Tiwari Deepak Bobbarjung

@Bobbarjung @Mitultiwari mitul@passage.ai deepak@passage.ai