Data Engineering and Streaming Analytics Welcome and Housekeeping - - PowerPoint PPT Presentation

▶

Jul 04, 2023 227 likes •506 views

Data Engineering and Streaming Analytics Welcome and Housekeeping You should have received instructions on how to participate in the training session If you have questions, you can use the Q&A window in Go To Webinar The

SLIDE 1

Data Engineering and Streaming Analytics

SLIDE 2

Welcome and Housekeeping

You should have received instructions on how

to participate in the training session

If you have questions, you can use the Q&A

window in Go To Webinar

The recording of the session will be made

available afuer the event

SLIDE 3

About Your Instructor

Doug Bateman is Director

f Training and Education

at Databricks. Prior to this role he was Director of Training at NewCircle.

SLIDE 4

Apache Spark - Genesis and Open Source

Spark was originally created at the AMP Lab at Berkeley. The

riginal creators went on to found Databricks.

Spark was created to address bringing data and machine learning together Spark was donated to the Apache Foundation to create the Apache Spark open source project

SLIDE 5

Accelerate innovation by unifying data science, engineering and business

Original creators of
2000+ global companies use our platform across big

data & machine learning lifecycle

VISION WHO WE ARE

Unified Analytics Platform

SOLUTION

SLIDE 6

Apache Spark: The 1st Unified Analytics Engine

Runtime Delta Spark Core Engine Big Data Processing

ETL + SQL +Streaming

Machine Learning

MLlib + SparkR

Uniquely combined Data & AI technologies

SLIDE 7

Open Format Based on Parquet With Transactions Apache Spark API’s

Introducing Delta Lake

A New Standard for Building Data Lakes

SLIDE 8

Apache Spark - A Unified Analytics Engine

SLIDE 9

Apache Spark

“Unified analytics engine for big data

processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Research project at UC Berkeley in 2009
APIs: Scala, Java, Python, R, and SQL
Built by more than 1,200 developers from more than 200

companies

SLIDE 10

HOW TO PROCESS LOTS OF DATA?

SLIDE 11

M&Ms

SLIDE 12

Spark Cluster

One Driver and many Executor JVMs

SLIDE 13

Data Lakes - A Key Enabler of Analytics

Data Lake

Data Science and ML

Recommendation Engines
Risk, Fraud, & Intrusion Detection
Customer Analytics
IoT & Predictive Maintenance
Genomics & DNA Sequencing

SLIDE 14

Data Lake

Data Lake Challenges

Unreliable Low Quality Data Slow Performance

Data Science and ML

Recommendation Engines
Risk, Fraud, & Intrusion Detection
Customer Analytics
IoT & Predictive Maintenance
Genomics & DNA Sequencing

> 65% big data projects fail per Gartner

X

SLIDE 15

1. Data Reliability Challenges

Failed production jobs leave data in corrupt state requiring tedious recovery

✗

Lack of schema enforcement creates inconsistent and low quality data Lack of consistency makes it almost impossible to mix appends ands reads, batch and streaming

SLIDE 16

2. Performance Challenges

Too many small or very big files - more time opening & closing files rather than reading contents (worse with streaming) Partitioning aka “poor man’s indexing”- breaks down if you picked the wrong fields or when data has many dimensions, high cardinality columns No caching - cloud storage throughput is low (S3 is 20-50MB/s/core vs 300MB/s/core for local SSDs)

SLIDE 17

Databricks Delta

Next-generation engine built on top of Spark

Co-designed compute & storage
Compatible with Spark API’s
Built on open standards (Parquet)

Databricks Delta

Indexes & Stats Transactional Log Versioned Parquet Files Leverages your cloud blob storage

SLIDE 18

Delta Makes Data Reliable

Delta Table

Transactional Log Versioned Parquet Files

Streaming Updates/Deletes Batch

ACID Transactions
Schema Enforcement
Upserts
Data Versioning

Key Features Reliable data always ready for analytics

SLIDE 19

Delta Makes Data More Performant

Fast, highly responsive queries at scale

Compaction
Caching
Data skipping
Z-ordering

Key Features Delta Table

Transactional Log Versioned Parquet Files

Delta Engine

I/O & Query Optimizations Open Spark API’s

SLIDE 20

CREATE TABLE ... USING delta … dataframe .write .format("delta") .save("/data")

Get Started with Delta using Spark APIs

CREATE TABLE ... USING parquet ... dataframe .write .format("parquet") .save("/data")

Instead of parquet... … simply say delta

SLIDE 21

Using Delta with your Existing Parquet Tables

Step 1: Convert Parquet to Delta Tables

CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS] [PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)] OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)

Step 2: Optimize Layout for Fast Queries

SLIDE 22

Upsert/Merge: Fine-grained Updates

MERGE INTO customers -- Delta table

USING updates ON customers.customerId = source.customerId WHEN MATCHED THEN

UPDATE SET address = updates.address

WHEN NOT MATCHED

THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)

SLIDE 23

SELECT count(*) FROM events TIMESTAMP AS OF timestamp SELECT count(*) FROM events VERSION AS OF version

Time Travel

spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/") INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)

Reproduce experiments & reports Rollback accidental bad writes

SLIDE 24

Apple: Threat Detection at Scale with Delta

> 100TB new data/day > 300B events/day

Data Science Machine Learning

Databricks Delta

Streaming Refinement Alerts

BEFORE DELTA

Took 20 engineers; 24 weeks to build
Only able to analyze 2 week window of data

WITH DELTA

Took 2 engineers; 2 weeks to build
Analyze 2 years of batch with streaming data

Detect signal across user, application and network logs; Quickly analyze the blast radius with ad hoc queries; Respond quickly in an automated fashion; Scaling across petabytes of data and 100’s of security analysts

KEYNOTE TALK

SLIDE 25

Spark References

Databricks
Apache Spark ML Programming Guide
Scala API Docs
Python API Docs
Spark Key Terms

SLIDE 26

Questions?

Further Training Options: http://bit.ly/DBTrng

Live Onsite Training
Live Online
Self Paced

Meet one of our Spark experts: http://bit.ly/ContactUsDB