Introduction to Big-data Management Review and next steps 1 What - - PowerPoint PPT Presentation

β–Ά
introduction to big data management
SMART_READER_LITE
LIVE PREVIEW

Introduction to Big-data Management Review and next steps 1 What - - PowerPoint PPT Presentation

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks) Higher-level data flow engines (Pig, SparkSQL) Storage formats (row, column, Parquet, LSM indexing) Document


slide-1
SLIDE 1

Introduction to Big-data Management

Review and next steps

1

slide-2
SLIDE 2

What We Covered

Storage (HDFS) Query processing (MapReduce, RDD, Hyracks) Higher-level data flow engines (Pig,

SparkSQL)

Storage formats (row, column, Parquet, LSM

indexing) Document databases (MongoDB) Machine learning (MLlib)

2

slide-3
SLIDE 3

HDFS

128 MB 128 MB 128 MB 128 MB … HDFS Block

3

Name node Data nodes

slide-4
SLIDE 4

Logical View of MapReduce

During MapReduce, the input and output are considered a set of key-value pairs 𝑙, 𝑀

4

Input

𝑙!, 𝑀! 𝑙!, 𝑀! … 𝑙!, 𝑀!

Map Reduce

Intermediate Data

𝑙", 𝑀" 𝑙", 𝑀" … 𝑙", 𝑀"

Output

𝑙#, 𝑀# 𝑙#, 𝑀# … 𝑙#, 𝑀#

slide-5
SLIDE 5

Map and Reduce Functions

Map Function

Maps a single input record to a set (possibly empty) of intermediate records Map: 𝑙!, 𝑀! β†’ βŸ¨π‘™", 𝑀"⟩

Combine Function

Combine: 𝑙", 𝑀" β†’ βŸ¨π‘™", 𝑀"⟩

Reduce Function

Reduces a set of intermediate records with the same key to a set (possibly empty) of output records Reduce: 𝑙", 𝑀" β†’ { 𝑙#, 𝑀# }

5

slide-6
SLIDE 6

Job Execution Overview

6

Driver Job submission Job preparation Map, Combine Shuffle Reduce Cleanup

slide-7
SLIDE 7

Resilient Distributed Dataset (RDD)

RDD is a pointer to a distributed dataset Stores information about how to compute the data rather than where the data is Transformation: Converts an RDD to another RDD Action: Returns an answer of an operation over an RDD Narrow Vs wide dependencies How RDD operations work

7

slide-8
SLIDE 8

SparkSQL

Dataframe (SparkSQL) Lazy execution Spark is aware of the data model Spark is aware of the query logic Can optimize the query RDD Lazy execution The data model is hidden from Spark The transformations and actions are black boxes Cannot optimize the query

8

slide-9
SLIDE 9

Storage formats

Difference between row and column formats

How attributes map to disk Major applications for each of them

Parquet files

A column store file format Handles nesting and replication Schema Γ  Maximum definition and repetition level Record Γ  Definition and repetition level for each attribute Do not forget to add null (non-existent) attributes

9

slide-10
SLIDE 10

Document databases

How a document database compares to a relational database (RDBMS)

Normalization (nesting and repetition) ACID compliance

How MongoDB compares attributes

10

slide-11
SLIDE 11

MLlib

Main components of MLlib

Transformers, e.g., feature extraction Estimator, e.g., clustering or regression Evaluator, e.g., precision and recall calculation Validator, e.g., k-fold cross validation

Pipeline: Transformation(s) + Estimator

11

slide-12
SLIDE 12

Did we cover everything?

12

slide-13
SLIDE 13

2019 Big data & AI Landscape

13

slide-14
SLIDE 14

Topics not Covered

Key-value stores Big graph analytics Visualization Streaming Coordination Cloud platforms

14

slide-15
SLIDE 15

Key-value Stores

15

Provide a simple API to insert/delete/update/search key-value pairs Records are indexed by key (typically a string) Internal structure is typically a Log-structured-merge tree (LSM) Not generally suitable for large-scale analytics

slide-16
SLIDE 16

Big Graph Analytics

Graphs are usually processed using a node- centric processing model Nodes and edges are both treated as first- class citizens Processing is normally iterative with a lot of iterations

16

slide-17
SLIDE 17

Visualization

Sometimes called Business Intelligence (BI) Focuses more on the end-user interface while producing nice graphs (e.g., bar charts and line graphs) Internally, the data is managed using the common big-data platforms but the systems are tuned to provide fast query response for ad-hoc queries

17

slide-18
SLIDE 18

Streaming

Some applications need to process data in real-time with a very small latency Examples: Twitter search, IoT applications, and social network trends Works primarily off main memory Keeps only the latest records to ensure real- time response

18

slide-19
SLIDE 19

Coordination

Most big-data systems are designed for shared-nothing large-scale analytics No coordination between machines is part of the design Coordination systems provide an easy way to coordinate the work in these distributed platforms, e.g., a catalog of information, work queue, and a global system status

19

slide-20
SLIDE 20

Machine Learning

ML is on the rise The increasing amount of data make it a big- data problem Some big ML systems emerge to provide scalable processing

20

slide-21
SLIDE 21

Cloud Platforms

Maintaining your own cluster is costly It could be underutilized most of the time Cloud platforms allow you to rent virtual machines to do your work and dispose them after They are well-integrated with big data platforms (such as Hadoop and Spark) to give the best user experience All you need is an internet connection and a credit card

21

slide-22
SLIDE 22

What is next?

22

slide-23
SLIDE 23

What is next?

Real big data is widely available Big data is like gold Only a few people know how to deal with it You’re now one of them Applications

Keep your hands dirty Consider using the public cloud (e.g., AWS, Google Cloud, or Microsoft Azure)

23

slide-24
SLIDE 24

Job Market

https://www.techicy.com/5-best-programming-languages-to-watch-out-in-2019-for-data-science.html

24

slide-25
SLIDE 25

Data Science

Credits: Drew Conway

25

slide-26
SLIDE 26

Data Science

https://mashimo.wordpress.com/2016/05/28/big-data-data-science-and-machine-learning-explained/

26

slide-27
SLIDE 27

Next Steps

CS

Big data tools Python/R/Scala

Math/Stats

Linear algebra Correlation analysis Hypothesis tests

Collaboration with domain experts

Visualization Prototyping

27

slide-28
SLIDE 28

CS

https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

28

slide-29
SLIDE 29

CS/Big Data

https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

29

slide-30
SLIDE 30

Math/Stats

https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

30

slide-31
SLIDE 31

Online Courses

https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

31

slide-32
SLIDE 32

Data Analytics

https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

32

slide-33
SLIDE 33

Thank You!

Good Luck J

33