Mastering Data with Spark and ML Strata London 2019 About Me IIT - - PowerPoint PPT Presentation

mastering data with spark and ml
SMART_READER_LITE
LIVE PREVIEW

Mastering Data with Spark and ML Strata London 2019 About Me IIT - - PowerPoint PPT Presentation

Mastering Data with Spark and ML Strata London 2019 About Me IIT Delhi, 1998 Founder and CEO, Nube Technologies Strata Data San Jose Program Committee Speaker at Spark Summit, Strata, GIDS etc Nube India based startup Deep technical


slide-1
SLIDE 1

Mastering Data with Spark and ML

Strata London 2019

slide-2
SLIDE 2

About Me

IIT Delhi, 1998 Founder and CEO, Nube Technologies Strata Data San Jose Program Committee Speaker at Spark Summit, Strata, GIDS etc

slide-3
SLIDE 3

Nube

India based startup Deep technical problems with an enterprise solution ML, Big Data, UX

slide-4
SLIDE 4

This talk today

Problem Statement Our Approach

slide-5
SLIDE 5

Simple business asks

Customer LTV Best supplier for a part Supplier payment terms Householding Cross Sell Opportunities M&A

slide-6
SLIDE 6

Actual Data

slide-7
SLIDE 7
slide-8
SLIDE 8

Actual data

Silos Data Quality Volumes

slide-9
SLIDE 9

Challenges

Variety of sources Scale Capturing rules for matching and merging Working across different business entities

slide-10
SLIDE 10

Wishlist

Any source and format Any entity type Any volume

slide-11
SLIDE 11

Reifier

AI powered data management, matching and merging different data sources to build a holistic view.

  • MDM
  • Fraud and Analytics
  • Sales and Marketing
  • Customer AML/KYC/cross and Upsell
  • Data Enrichment
  • Reference data Management
  • Data Quality
slide-12
SLIDE 12

Our stack

slide-13
SLIDE 13

Wishlist

Any source and format Any entity type Any volume

slide-14
SLIDE 14

Any source and format

Based on RDDs Custom source and sink formats written by us/borrowed from community

slide-15
SLIDE 15

Any source/sink, Any format

Elastic: Cassandra:

slide-16
SLIDE 16

Problems with RDDs

Record wise reading was good, but adding structure to the data was left to us. reifier.Tuple - indexed data structure Development and maintenance nightmare

slide-17
SLIDE 17

Reifier 2.0

  • Datasets
  • Pipe abstraction
slide-18
SLIDE 18

Building Dataset through Pipe

}

slide-19
SLIDE 19

Spark Integration

Tried Livy etc Additional dependency Finally two ways in which we integrate. One local SparkContext. Second through the SparkLauncher

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

Wishlist

Any source and format Any entity type Any volume

slide-23
SLIDE 23

Any entity type

  • Traditional rule based system fails
  • AI to the rescue
  • Also Cassandra
slide-24
SLIDE 24

Reifier Interactive Learner

slide-25
SLIDE 25

Reifier Interactive Learner

slide-26
SLIDE 26

Any scale

Add Spark to the mix Ouch, cartesian join - 1million records = Order of a trillion comparisons Learn what to join

slide-27
SLIDE 27

AutoML

Build multiple models based on the training data Optimize for accuracy and performance Use Spark to train and assess different models

slide-28
SLIDE 28

Cassandra

Any Entity Any Scale

slide-29
SLIDE 29

Cassandra Training

Primary Key - Cluster Id, Record Id Secondary Index - r_isMatch

slide-30
SLIDE 30

Cassandra Entity

Primary Key - Record Id Secondary Index - Cluster Id

slide-31
SLIDE 31

Elastic

Free flowing search Adhoc analytics Realtime Plugin

slide-32
SLIDE 32

Thank You! www.nubetech.co sonal@nubetech.co