Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map - - PowerPoint PPT Presentation

declarative mapreduce
SMART_READER_LITE
LIVE PREVIEW

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map - - PowerPoint PPT Presentation

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce Grouped aggregated Map Reduce Equi-join Map Reduce Map Reduce Non-equi-join 10/29/2018 2 Declarative Languages Describe what you want to do


slide-1
SLIDE 1

Declarative MapReduce

10/29/2018 1

slide-2
SLIDE 2

MapReduce Examples

Filter Aggregate Grouped aggregated Equi-join Non-equi-join

10/29/2018 2

Map Map Reduce Map Reduce Map Reduce Map Reduce

slide-3
SLIDE 3

Declarative Languages

Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)?

10/29/2018 3

slide-4
SLIDE 4

Pig

10/29/2018 4

A system built on-top of Hadoop (Now supports Spark as well) Provides a SQL-ETL-like query language termed Pig Latin Compiles Pig Latin programs into MapReduce programs

slide-5
SLIDE 5

Examples

Filter: Return all the lines that have a user- specified response code, e.g., 200.

10/29/2018 5

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes);

  • k_lines = FILTER log BY response = ‘200’;

STORE ok_lines into ‘filtered_output’;

Map

slide-6
SLIDE 6

Examples

Grouped aggregate Find the total number of bytes per response code

10/29/2018 6

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, SUM(bytes); STORE grouped_aggregate into ‘grouped_output’;

Map Reduce

slide-7
SLIDE 7

Examples

Grouped aggregate Find the average number of bytes per response code

10/29/2018 7

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, AVG(bytes); STORE grouped_aggregate into ‘grouped_output’;

slide-8
SLIDE 8

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source

10/29/2018 8

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host);

slide-9
SLIDE 9

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source and happened within an hour of each other

10/29/2018 9

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000;

slide-10
SLIDE 10

How it works

LOAD operation

Determines the input path and InputFormat

STORE operation

Determines the output path and OutputFormat

FILTER and FOREACH

Translated into map-only jobs

AGGREGATE and JOIN

Translated into map-reduce jobs

All are compiled into one or more MapReduce jobs

10/29/2018 10

slide-11
SLIDE 11

Additional Features

Lazy execution

Nothing gets actually executed until the STORE command is reached

Consolidation of map-only jobs

Map-only jobs (FILTER and FOREACH) can be consolidated into a next job’s map function or a previous job’s reduce function

10/29/2018 11

slide-12
SLIDE 12

A Complex Example

10/29/2018 12

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (…); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (…); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; grouped = GROUP filtered BY log1::host; agg_groups = FOREACH grouped GENERATE group, COUNT(*); STORE agg_groups INTO ‘final_result';

slide-13
SLIDE 13

Further Readings

Pig home page: https://pig.apache.org Detailed documentation: http://pig.apache.org/docs/r0.17.0/ The original Pig Latin paper:

Olston, Christopher, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. "Pig latin: a not-so-foreign language for data processing." In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099-1110. ACM, 2008.

10/29/2018 13