Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A - - PowerPoint PPT Presentation

▶

Dec 02, 2022 180 likes •427 views

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2 Logical View of MapReduce During MapReduce, the

SLIDE 1

Hadoop Map Reduce

SLIDE 2

MapReduce

2-in-1

A programming paradigm A query execution engine

A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN

SLIDE 3

Logical View of MapReduce

During MapReduce, the input and output are considered a set of key-value pairs 𝑙, 𝑤

Input

𝑙1, 𝑤1 𝑙1, 𝑤1 … 𝑙1, 𝑤1

Map Reduce

Intermediate Data

𝑙2, 𝑤2 𝑙2, 𝑤2 … 𝑙2, 𝑤2

Output

𝑙3, 𝑤3 𝑙3, 𝑤3 … 𝑙3, 𝑤3

SLIDE 4

Map and Reduce Functions

Map Function

Maps a single input record to a set (possibly empty) of intermediate records Map: 𝑙1, 𝑤1 → ⟨𝑙2, 𝑤2⟩

Reduce Function

Reduces a set of intermediate records with the same key to a set (possibly empty) of output records Reduce: 𝑙2, 𝑤2 → { 𝑙3, 𝑤3 }

SLIDE 5

Functional Programming

MapReduce is functional programming Both map and reduce functions are memoryless/stateless

They cannot keep an internal state They cannot remember previous records They cannot be randomized

Why?

To allow Hadoop to parallelize the execution Execute them out-of-order Rerun failing tasks

SLIDE 6

Overview

Driver

Slave nodes Master node Developer

MR Program MR Job

SLIDE 7

Job Execution Overview

Driver Job submission Job preparation Map Shuffle Reduce Cleanup

SLIDE 8

Job Submission

Execution location: Driver node A driver machine should have the following

Compatible Hadoop binaries Cluster configuration files Network access to the master node

Collects job information from the user

Input and output paths Map, reduce, and any other functions Any additional user configuration

Packages all this in a Hadoop Configuration

SLIDE 9

Hadoop Configuration

Key: String Value: String Input hdfs://user/eldawy/README.txt Output hdfs://user/eldawy/wordcount Mapper edu.ucr.cs.cs167.eldawy.WordCount Reducer … JAR File … User-defined User-defined

Master node

Serialized over network

SLIDE 10

Job Preparation

Runs on the master node Gets the job ready for parallel execution Collects the JAR file that contains the user- defined functions, e.g., Map and Reduce Writes the JAR and configuration to HDFS to be accessible by the executors Looks at the input file(s) to decide how many map tasks are needed Makes some sanity checks Finally, it pushes the BRB (Big Red Button)

SLIDE 11

Job Preparation

Configuration JAR File

Master node

HDFS InputFormat#getSplits() Split1 Split2 .. SplitM Mapper1 Mapper2 .. MapperM FileInputSplit Path Start End

SLIDE 12

Map Phase

Runs in parallel on worker nodes 𝑁 Mappers:

Read the input Apply the map function Apply the combine function (if configured) Store the map output

There is no guaranteed ordering for processing the input splits

SLIDE 13

Map Phase

Master node

IS1 IS2 IS3 IS4 IS5 ISM …

Input Splits (Map tasks)

SLIDE 14

Map Task

Reads the job configuration and task information (mainly, InputSplit) Instantiates an object of the Mapper class Instantiates a record reader for the assigned input split Calls Mapper#setup(Context) Reads records one-by-one from the record reader and passes them to the map function The map function writes the output to the context

SLIDE 15

MapContext

Keeps track of which input split is being read and which records are being processed Holds all the job configuration and some additional information about the map task Materializes the map output

SLIDE 16

Map Output

What really happens to the map output? It depends on the number of reducers

0 reducers: Map output is written directly to HDFS as the final answer 1+ reducers: Map output is passed to the shuffle phase

SLIDE 17

Shuffle Phase

Executed only in the case of one or more reducers Transfers data between the mappers and reducers Groups records by their keys to ensure local processing in the reduce phase

SLIDE 18

Shuffle Phase

Map1 Map2 Map3 MapM … Reduce1 Reduce2 ReduceN …

SLIDE 19

Mapi

Shuffle Phase (Map-side)

Input Split map k v k v k v k v k v k v k v Partition k v k v k v k v k v k v k v k v k v kA kZ k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v Reduce1 Reduce2 ReduceN … 1 N-1 1 N-1 1 N-1 1 N-1

SLIDE 20

Shuffle Phase (Reduce-side)

Reducej Map1 Map2 Map3 MapM … Copy Sort Reduce part1 part2 part3 partM k v k v k v

k v k v k v k v k v k v k v

SLIDE 21

Reduce Phase

Apply the reduce function to each group of similar keys

k1 v k1 v k2 v k2 v k3 v k3 v k3 v

reduce reduce reduce

k… v

kN v kN v kN v kN v kN v

reduce reduce

utput

SLIDE 22

Output Writing

Materializes the final output to disk All results are from one process (mapper/reducer) are stored in a subdirectory An OutputFormat is used to

Create any files in the output directory Write the output records one-by-one to the output Merge the results from all the tasks (if needed)

While the output writing runs in parallel, the final commit step runs on a single machine

SLIDE 23

Hadoop Map Reduce

MapReduce

2-in-1

A programming paradigm A query execution engine

A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN

Logical View of MapReduce

During MapReduce, the input and output are considered a set of key-value pairs 𝑙, 𝑤

Input

𝑙1, 𝑤1 𝑙1, 𝑤1 … 𝑙1, 𝑤1

Intermediate Data

𝑙2, 𝑤2 𝑙2, 𝑤2 … 𝑙2, 𝑤2

Output

𝑙3, 𝑤3 𝑙3, 𝑤3 … 𝑙3, 𝑤3

Map and Reduce Functions

Map Function

Maps a single input record to a set (possibly empty) of intermediate records Map: 𝑙1, 𝑤1 → ⟨𝑙2, 𝑤2⟩

Reduce Function

Reduces a set of intermediate records with the same key to a set (possibly empty) of output records Reduce: 𝑙2, 𝑤2 → { 𝑙3, 𝑤3 }

Functional Programming

MapReduce is functional programming Both map and reduce functions are memoryless/stateless

They cannot keep an internal state They cannot remember previous records They cannot be randomized

Why?

To allow Hadoop to parallelize the execution Execute them out-of-order Rerun failing tasks

Overview

Job Execution Overview

Job Submission

Execution location: Driver node A driver machine should have the following

Compatible Hadoop binaries Cluster configuration files Network access to the master node

Collects job information from the user

Input and output paths Map, reduce, and any other functions Any additional user configuration

Packages all this in a Hadoop Configuration

Hadoop Configuration

Job Preparation

Job Preparation

Map Phase

Runs in parallel on worker nodes 𝑁 Mappers:

Read the input Apply the map function Apply the combine function (if configured) Store the map output

There is no guaranteed ordering for processing the input splits

Map Phase

Map Task

MapContext

Keeps track of which input split is being read and which records are being processed Holds all the job configuration and some additional information about the map task Materializes the map output

Map Output

What really happens to the map output? It depends on the number of reducers

0 reducers: Map output is written directly to HDFS as the final answer 1+ reducers: Map output is passed to the shuffle phase

Shuffle Phase

Executed only in the case of one or more reducers Transfers data between the mappers and reducers Groups records by their keys to ensure local processing in the reduce phase

Shuffle Phase

Shuffle Phase (Map-side)

Shuffle Phase (Reduce-side)

Reduce Phase

Apply the reduce function to each group of similar keys

Output Writing

Materializes the final output to disk All results are from one process (mapper/reducer) are stored in a subdirectory An OutputFormat is used to

Create any files in the output directory Write the output records one-by-one to the output Merge the results from all the tasks (if needed)

While the output writing runs in parallel, the final commit step runs on a single machine

Advanced Issues

Map failures Reduce failures Straggler problem Custom keys and values Efficient sorting on serialized data Pipeline MapReduce jobs