[PPT] - Topics in Data Science Cheng Ren, Lixing Lian Outline PowerPoint Presentation

SLIDE 1

MapReduce ¡Extension ¡

Topics ¡in ¡Data ¡Science

Cheng ¡Ren, ¡Lixing ¡Lian ¡

SLIDE 2

Scien7fic ¡Workloads ¡

¡-‑ ¡Hadoop’s ¡Adolescence: ¡An ¡Analysis ¡of ¡Hadoop ¡Usage ¡in ¡Scien7fic ¡Workloads ¡

¡

Itera7ve ¡extension ¡

¡-‑ ¡Haloop: ¡efficient ¡itera7ve ¡data ¡processing ¡on ¡large ¡clusters ¡ ¡

Adap7ve ¡Indexes ¡

¡-‑ ¡Only ¡Aggressive ¡Elephants ¡are ¡Fast ¡Elephants(HAIL)

Outline

SLIDE 3

Hadoop's ¡adolescence ¡ ¡ ¡ ¡An ¡analysis ¡

f ¡Hadoop ¡usage ¡ ¡in ¡scien7fic ¡

workload ¡

SLIDE 4

SLIDE 5

SLIDE 6

SLIDE 7

SLIDE 8

SLIDE 9

SLIDE 10

SLIDE 11

SLIDE 12

SLIDE 13

SLIDE 14

SLIDE 15

SLIDE 16

SLIDE 17

SLIDE 18

SLIDE 19

SLIDE 20

SLIDE 21

SLIDE 22

HaLoop: ¡Efficient ¡Itera7ve ¡Data ¡ Processing ¡On ¡Large ¡Clusters

Yingyi ¡Bu, ¡Bill ¡Howe, ¡Magda ¡Balazinska, ¡Michael ¡D. ¡Ernst

SLIDE 23

Mo7va7on ¡
Examples ¡that ¡cannot ¡be ¡executed ¡perfectly ¡
Architecture ¡
Caching ¡ideas ¡

¡

Outline

SLIDE 24

MapReduce ¡can’t ¡express ¡recursion/itera7on ¡
Lots ¡of ¡interes7ng ¡programs ¡need ¡loops ¡

¡ -‑ ¡graph ¡algorithms ¡ ¡ -‑ ¡clustering ¡ ¡ -‑ ¡machine ¡learning ¡ ¡ -‑ ¡recursive ¡queries ¡(CTEs, ¡datalog, ¡WITH ¡clause) ¡

Dominant ¡solu7on: ¡Use ¡a ¡driver ¡program ¡outside ¡of ¡

MapReduce ¡

Hypothesis: ¡making ¡MapReduce ¡loop-‑aware ¡affords ¡
p7miza7on ¡

Mo7va7on

‑ ¡lays ¡a ¡founda7on ¡for ¡scalable ¡implementa7ons ¡of ¡

recursive ¡languages

SLIDE 25

Example ¡1: ¡PageRank

SLIDE 26

PageRank ¡Implementa7on ¡on ¡MapReduce

SLIDE 27

¡ ¡ ¡ ¡What’s ¡the ¡problem?

L ¡and ¡Count ¡are ¡loop ¡invariants, ¡but ¡

1. ¡They ¡are ¡loaded ¡on ¡each ¡itera7on ¡
2. ¡They ¡are ¡shuffled ¡on ¡each ¡itera7on ¡
3. ¡Also, ¡fixpoint ¡evaluated ¡as ¡a ¡separate ¡MapReduce ¡job ¡per ¡itera7on

SLIDE 28

¡ ¡ ¡ ¡Example ¡2: ¡Transi7ve ¡Closure

SLIDE 29

Transi7ve ¡Closure ¡on ¡MapReduce

SLIDE 30

¡ ¡ ¡ ¡What’s ¡the ¡problem?

Friend ¡is ¡loop ¡invariant, ¡but ¡

1. Friend ¡is ¡loaded ¡on ¡each ¡itera7on ¡
2. Friend ¡is ¡shuffled ¡on ¡each ¡itera7on

SLIDE 31

Architecture ¡
Cache ¡loop-‑invariant ¡data ¡

¡

Programming ¡Model

Push ¡loops ¡into ¡MapReduce!

SLIDE 32

HaLoop ¡Architecture

SLIDE 33

Inter-‑itera7on ¡caching

SLIDE 34

RI: ¡Reducer ¡Input ¡Cache

Provides: ¡

¡ ¡ ¡ ¡ ¡-‑ ¡Access ¡to ¡loop ¡invariant ¡data ¡without ¡map/shuffle ¡

Data: ¡

¡ -‑ ¡Reducer ¡func7on ¡

Assumes: ¡

¡ 1. ¡Sta7c ¡par77oning ¡(implies: ¡no ¡new ¡nodes) ¡ ¡ 2. ¡Determinis7c ¡mapper ¡implementa7on ¡ ¡

PageRank ¡

¡ -‑ ¡Avoid ¡loading ¡and ¡shuffling ¡the ¡web ¡graph ¡at ¡every ¡itera7on ¡

Transi7ve ¡Closure ¡

¡ -‑ ¡Avoid ¡loading ¡and ¡shuffling ¡the ¡friends ¡graph ¡at ¡every ¡itera7on ¡

SLIDE 35

RO: ¡Reducer ¡Output ¡Cache

Provides: ¡

¡ ¡ ¡ ¡ ¡-‑ ¡Distributed ¡access ¡to ¡output ¡of ¡previous ¡itera7ons ¡

Used ¡by: ¡

¡ -‑ ¡Fixpoint ¡evalua7on ¡

Assumes: ¡

¡ 1. ¡Par77oning ¡constant ¡across ¡itera7ons ¡ ¡ 2. ¡Reducer ¡output ¡key ¡func7onally ¡determines ¡ ¡ ¡ Reducer ¡input ¡key ¡ ¡

PageRank ¡

¡ -‑ ¡Allows ¡distributed ¡fixpoint ¡evalua7on ¡ ¡ -‑ ¡Obviates ¡extra ¡MapReduce ¡job ¡

Transi7ve ¡Closure ¡

¡ -‑ ¡No ¡help ¡

SLIDE 36

MI: ¡Mapper ¡Input ¡Cache

Provides: ¡

¡ ¡ ¡ ¡ ¡-‑ ¡Access ¡to ¡non-‑local ¡mapper ¡input ¡on ¡later ¡itera7ons ¡

Data ¡for: ¡

¡ ¡ ¡ -‑ ¡Map ¡func7on ¡

Assumes: ¡

¡ Mapper ¡input ¡does ¡not ¡change ¡ ¡

‑ ¡Avoids ¡non-‑local ¡data ¡reads ¡on ¡itera7ons ¡> ¡0

SLIDE 37

Mapper/reducer ¡stay ¡the ¡same! ¡
Touch ¡points ¡

¡ ¡ ¡– ¡Input/Output: ¡for ¡each ¡<itera7on, ¡step> ¡ ¡ ¡ ¡– ¡Cache ¡filter: ¡which ¡tuple ¡to ¡cache? ¡ ¡ ¡ ¡– ¡Distance ¡func7on: ¡op7onal ¡

Nested ¡job ¡containing ¡child ¡jobs ¡as ¡loop ¡body ¡
Minimize ¡extra ¡programming ¡efforts

Programming ¡Model

SLIDE 38

Rela7vely ¡simple ¡changes ¡to ¡MapReduce/Hadoop ¡can ¡

¡-‑ ¡support ¡itera7ve/recursive ¡programs ¡ ¡-‑ ¡TaskTracker ¡(Cache ¡management) ¡ ¡-‑ ¡Scheduler ¡(Cache ¡awareness) ¡ ¡-‑ ¡Programming ¡model ¡(mul7-‑step ¡loop ¡bodies, ¡cache ¡control) ¡ ¡

Op7miza7ons ¡

¡-‑ ¡Caching ¡reducer ¡input ¡realizes ¡the ¡largest ¡gain ¡ ¡-‑ ¡Good ¡to ¡eliminate ¡extra ¡MapReduce ¡step ¡for ¡termina7on ¡checks ¡ ¡-‑ ¡Mapper ¡input ¡cache ¡benefit ¡inconclusive; ¡need ¡a ¡busier ¡cluster ¡

Conclusions

SLIDE 39

Only ¡Aggressive ¡Elephants ¡ are ¡fast ¡Elephants

Jens ¡Diirich, ¡Jorge-‑Arnulfo ¡Quiané-‑Ruiz, ¡Stefan ¡Richter, ¡ Stefan ¡Schuh, ¡Alekh ¡Jindal, ¡Jörg ¡Schad

SLIDE 40

Mo7va7on ¡
Comparison ¡between ¡Hadoop ¡and ¡HAIL ¡

¡

Upload ¡pipeline ¡
Query ¡pipeline ¡

¡

Outline

SLIDE 41

Analyze ¡a ¡large ¡web ¡log ¡by ¡filtering ¡condi7ons. ¡(source ¡

IP, ¡web ¡address) ¡ ¡

He ¡uses ¡a ¡sequence ¡of ¡different ¡filter ¡condi7ons, ¡each ¡
ne ¡triggering ¡a ¡new ¡MapReduce ¡job. ¡

¡

He ¡is ¡not ¡exactly ¡sure ¡what ¡he ¡is ¡looking ¡for. ¡

¡

“Let’s ¡see ¡what ¡I ¡am ¡going ¡to ¡encounter ¡on ¡the ¡way.”

Bob

SLIDE 42

This ¡kind ¡of ¡use-‑case ¡illustrates ¡an ¡exploratory ¡usage ¡of ¡

Hadoop ¡MapReduce. ¡ ¡

It ¡is ¡a ¡major ¡use-‑case ¡of ¡Hadoop ¡MapReduce. ¡

¡ -‑ ¡One ¡major ¡problem: ¡slow ¡query ¡run7mes. ¡ ¡ -‑ ¡Time ¡dominated ¡by ¡the ¡I/O ¡for ¡reading ¡all ¡input ¡data.

Bob

SLIDE 43

Hadoop ¡Aggressive ¡Indexing ¡Library

SLIDE 44

VS.

HAIL ¡+ ¡ MapReduce HDFS ¡+ ¡ MapReduce

SLIDE 45

HDFS ¡+ ¡MapReduce

SLIDE 46

HDFS

horizontal ¡par77ons HDFS ¡blocks ¡ 64MB ¡(default) Datanodes

SLIDE 47

HDFS

SLIDE 48

HDFS

SLIDE 49

HDFS

SLIDE 50

HDFS

SLIDE 51

HDFS

Allows ¡two ¡Failovers

SLIDE 52

MapReduce

map(row) ¡-‑> ¡set ¡of ¡(ikey, ¡value)

SLIDE 53

MapReduce

map(row) ¡-‑> ¡set ¡of ¡(ikey, ¡value)

SLIDE 54

MapReduce

map(row) ¡-‑> ¡set ¡of ¡(ikey, ¡value)

SLIDE 55

MapReduce

map(docID, ¡document) ¡-‑> ¡set ¡of ¡(term, ¡docID)

SLIDE 56

HAIL ¡+ ¡MapReduce

SLIDE 57

HAIL

horizontal ¡par77ons HDFS ¡blocks ¡ 64MB ¡(default)

SLIDE 58

HAIL

SLIDE 59

HAIL

SLIDE 60

HAIL

SLIDE 61

HAIL

SLIDE 62

HAIL

1. Convert ¡the ¡input ¡file ¡into ¡binary ¡PAX ¡

¡

2. Create ¡a ¡series ¡of ¡different ¡sort ¡orders ¡

¡

3. Create ¡mul7ple ¡clustered ¡indexes.
‑ ¡If ¡indexes ¡cannot ¡help, ¡fall ¡back ¡to ¡standard ¡Hadoop ¡scanning.

SLIDE 63

HAIL ¡changes ¡the ¡upload ¡pipeline ¡

f ¡HDFS ¡in ¡order ¡to ¡create ¡different ¡

clustered ¡indexes ¡on ¡each ¡data ¡ block ¡replica.

SLIDE 64

HAIL ¡Upload ¡Pipeline

SLIDE 65

Why ¡Clustered ¡Indexes? ¡

‑ Unclustered ¡indexes ¡are ¡only ¡compe77ve ¡for ¡very ¡selec7ve ¡

queries ¡as ¡they ¡may ¡trigger ¡considerable ¡random ¡I/O ¡for ¡ non-‑selec7ve ¡index ¡traversals. ¡ ¡

‑ Clustered ¡index ¡do ¡not ¡have ¡that ¡problem. ¡Whatever ¡the ¡

selec7vity, ¡we ¡will ¡read ¡the ¡clustered ¡index ¡and ¡scan ¡the ¡ qualifying ¡blocks.

HAIL ¡Upload ¡Pipeline

SLIDE 66

Seman7cs ¡of ¡an ¡ACK ¡for ¡a ¡packet ¡of ¡a ¡block ¡are ¡changed ¡ ¡ ¡From ¡“packet ¡received, ¡validated, ¡and ¡flushed” ¡ ¡ ¡ To ¡“packet ¡received ¡and ¡validated”.

HAIL ¡Upload ¡Pipeline

In ¡parallel ¡to ¡forwarding ¡and ¡reassembling ¡packets, ¡each ¡ datanode ¡sorts ¡the ¡data, ¡creates ¡indexes,and ¡forms ¡a ¡HAIL ¡

block. ¡

SLIDE 67

HAIL ¡Upload ¡Pipeline

Enrich ¡the ¡HDFS ¡namenode ¡to ¡schedule ¡map ¡tasks ¡close ¡to ¡replicas ¡ having ¡a ¡suitable ¡index ¡

Dir_rep ¡mapping ¡(blockID, ¡datanode) ¡à ¡HAILBlockReplicaInfo ¡
HAILBlockReplicaInfo ¡contains ¡detailed ¡informa7on ¡about ¡the ¡

types ¡of ¡available ¡indexes

SLIDE 68

HAIL ¡Query ¡Pipeline

SLIDE 69

Upload ¡Time

SLIDE 70

Query ¡Times

Individual ¡Jobs: ¡Weblog

SLIDE 71

Fast ¡Indexing ¡and ¡Fast ¡Querying

SLIDE 72

Ques7ons? ¡Ask ¡Bob. ¡

MapReduce ¡Extension ¡

Outline

Hadoop's ¡adolescence ¡ ¡ ¡ ¡An ¡analysis ¡

workload ¡

HaLoop: ¡Efficient ¡Itera7ve ¡Data ¡ Processing ¡On ¡Large ¡Clusters

Outline

Mo7va7on

Example ¡1: ¡PageRank

PageRank ¡Implementa7on ¡on ¡MapReduce

¡ ¡ ¡ ¡What’s ¡the ¡problem?

¡ ¡ ¡ ¡Example ¡2: ¡Transi7ve ¡Closure

Transi7ve ¡Closure ¡on ¡MapReduce

¡ ¡ ¡ ¡What’s ¡the ¡problem?

Push ¡loops ¡into ¡MapReduce!

HaLoop ¡Architecture

Inter-­‑itera7on ¡caching

RI: ¡Reducer ¡Input ¡Cache

RO: ¡Reducer ¡Output ¡Cache

MI: ¡Mapper ¡Input ¡Cache

Programming ¡Model

Conclusions

Only ¡Aggressive ¡Elephants ¡ are ¡fast ¡Elephants

Outline

Bob

Bob

Hadoop ¡Aggressive ¡Indexing ¡Library

VS.

HAIL ¡+ ¡ MapReduce HDFS ¡+ ¡ MapReduce

HDFS ¡+ ¡MapReduce

HDFS

HDFS

HDFS

HDFS

HDFS

HDFS

MapReduce

MapReduce

MapReduce

MapReduce

HAIL ¡+ ¡MapReduce

HAIL

HAIL

HAIL

HAIL

HAIL

HAIL

HAIL ¡changes ¡the ¡upload ¡pipeline ¡

clustered ¡indexes ¡on ¡each ¡data ¡ block ¡replica.

HAIL ¡Upload ¡Pipeline

HAIL ¡Upload ¡Pipeline

HAIL ¡Upload ¡Pipeline

HAIL ¡Upload ¡Pipeline

HAIL ¡Query ¡Pipeline

Upload ¡Time

Query ¡Times

Fast ¡Indexing ¡and ¡Fast ¡Querying

Inter-‑itera7on ¡caching