Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A - - PowerPoint PPT Presentation

hadoop map reduce
SMART_READER_LITE
LIVE PREVIEW

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A - - PowerPoint PPT Presentation

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2 Logical View of MapReduce During MapReduce, the


slide-1
SLIDE 1

Hadoop Map Reduce

1

slide-2
SLIDE 2

MapReduce

2-in-1

A programming paradigm A query execution engine

A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN

2

slide-3
SLIDE 3

Logical View of MapReduce

During MapReduce, the input and output are considered a set of key-value pairs 𝑙, 𝑤

3

Input

𝑙1, 𝑤1 𝑙1, 𝑤1 … 𝑙1, 𝑤1

Map Reduce

Intermediate Data

𝑙2, 𝑤2 𝑙2, 𝑤2 … 𝑙2, 𝑤2

Output

𝑙3, 𝑤3 𝑙3, 𝑤3 … 𝑙3, 𝑤3

slide-4
SLIDE 4

Map and Reduce Functions

Map Function

Maps a single input record to a set (possibly empty) of intermediate records Map: 𝑙1, 𝑤1 → ⟨𝑙2, 𝑤2⟩

Reduce Function

Reduces a set of intermediate records with the same key to a set (possibly empty) of output records Reduce: 𝑙2, 𝑤2 → { 𝑙3, 𝑤3 }

4

slide-5
SLIDE 5

Functional Programming

MapReduce is functional programming Both map and reduce functions are memoryless/stateless

They cannot keep an internal state They cannot remember previous records They cannot be randomized

Why?

To allow Hadoop to parallelize the execution Execute them out-of-order Rerun failing tasks

5

slide-6
SLIDE 6

Overview

6

Driver

Slave nodes Master node Developer

MR Program MR Job

slide-7
SLIDE 7

Job Execution Overview

7

Driver Job submission Job preparation Map Shuffle Reduce Cleanup

slide-8
SLIDE 8

Job Submission

Execution location: Driver node A driver machine should have the following

Compatible Hadoop binaries Cluster configuration files Network access to the master node

Collects job information from the user

Input and output paths Map, reduce, and any other functions Any additional user configuration

Packages all this in a Hadoop Configuration

8

slide-9
SLIDE 9

Hadoop Configuration

Key: String Value: String Input hdfs://user/eldawy/README.txt Output hdfs://user/eldawy/wordcount Mapper edu.ucr.cs.cs167.eldawy.WordCount Reducer … JAR File … User-defined User-defined

9

Master node

Serialized over network

slide-10
SLIDE 10

Job Preparation

Runs on the master node Gets the job ready for parallel execution Collects the JAR file that contains the user- defined functions, e.g., Map and Reduce Writes the JAR and configuration to HDFS to be accessible by the executors Looks at the input file(s) to decide how many map tasks are needed Makes some sanity checks Finally, it pushes the BRB (Big Red Button)

10

slide-11
SLIDE 11

Job Preparation

11

Configuration JAR File

Master node

HDFS InputFormat#getSplits() Split1 Split2 .. SplitM Mapper1 Mapper2 .. MapperM FileInputSplit Path Start End

slide-12
SLIDE 12

Map Phase

Runs in parallel on worker nodes 𝑁 Mappers:

Read the input Apply the map function Apply the combine function (if configured) Store the map output

There is no guaranteed ordering for processing the input splits

12

slide-13
SLIDE 13

Map Phase

13

Master node

IS1 IS2 IS3 IS4 IS5 ISM …

Input Splits (Map tasks)

slide-14
SLIDE 14

Map Task

Reads the job configuration and task information (mainly, InputSplit) Instantiates an object of the Mapper class Instantiates a record reader for the assigned input split Calls Mapper#setup(Context) Reads records one-by-one from the record reader and passes them to the map function The map function writes the output to the context

14

slide-15
SLIDE 15

MapContext

Keeps track of which input split is being read and which records are being processed Holds all the job configuration and some additional information about the map task Materializes the map output

15

slide-16
SLIDE 16

Map Output

What really happens to the map output? It depends on the number of reducers

0 reducers: Map output is written directly to HDFS as the final answer 1+ reducers: Map output is passed to the shuffle phase

16

slide-17
SLIDE 17

Shuffle Phase

Executed only in the case of one or more reducers Transfers data between the mappers and reducers Groups records by their keys to ensure local processing in the reduce phase

17

slide-18
SLIDE 18

Shuffle Phase

18

Map1 Map2 Map3 MapM … Reduce1 Reduce2 ReduceN …

slide-19
SLIDE 19

Mapi

Shuffle Phase (Map-side)

19

Input Split map k v k v k v k v k v k v k v Partition k v k v k v k v k v k v k v k v k v kA kZ k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v k v Reduce1 Reduce2 ReduceN … 1 N-1 1 N-1 1 N-1 1 N-1

slide-20
SLIDE 20

Shuffle Phase (Reduce-side)

20

Reducej Map1 Map2 Map3 MapM … Copy Sort Reduce part1 part2 part3 partM k v k v k v

k v k v k v k v k v k v k v

slide-21
SLIDE 21

Reduce Phase

Apply the reduce function to each group of similar keys

21

k1 v k1 v k2 v k2 v k3 v k3 v k3 v

reduce reduce reduce

k… v

kN v kN v kN v kN v kN v

reduce reduce

  • utput
slide-22
SLIDE 22

Output Writing

Materializes the final output to disk All results are from one process (mapper/reducer) are stored in a subdirectory An OutputFormat is used to

Create any files in the output directory Write the output records one-by-one to the output Merge the results from all the tasks (if needed)

While the output writing runs in parallel, the final commit step runs on a single machine

22

slide-23
SLIDE 23

Advanced Issues

Map failures Reduce failures Straggler problem Custom keys and values Efficient sorting on serialized data Pipeline MapReduce jobs

23