[PPT] - Using Pig, Hive, and Impala with Hadoop Jay Urbain, PowerPoint Presentation

SLIDE 1

¡ Using ¡Pig, ¡Hive, ¡and ¡Impala ¡with ¡ Hadoop ¡ ¡

Jay ¡Urbain, ¡PhD ¡

SLIDE 2

We ¡are ¡genera<ng ¡data ¡faster ¡than ¡ever ¡

– Processes ¡are ¡increasingly ¡automated ¡ – People ¡are ¡increasingly ¡interac<ng ¡online ¡ – Systems ¡are ¡increasingly ¡interconnected ¡

Velocity ¡

SLIDE 3

We ¡are ¡producing ¡a ¡wide ¡variety ¡of ¡data ¡

– Social ¡network ¡connec<ons ¡ – Images, ¡audio, ¡and ¡video ¡ – Server ¡and ¡applica<on ¡log ¡files ¡ – Product ¡ra<ngs ¡on ¡shopping ¡and ¡review ¡Web ¡sites ¡ – And ¡much ¡more… ¡

Not ¡all ¡of ¡this ¡maps ¡cleanly ¡to ¡the ¡rela<onal ¡model ¡

Variety ¡

SLIDE 4

Every ¡day… ¡

– More ¡than ¡1.5 ¡billion ¡shares ¡are ¡traded ¡on ¡the ¡New ¡York ¡ Stock ¡Exchange ¡ – Facebook ¡stores ¡2.7 ¡billion ¡comments ¡and ¡‘Likes’ ¡ – Google ¡processes ¡about ¡24 ¡petabytes ¡of ¡data ¡

Every ¡minute… ¡

– Foursquare ¡handles ¡more ¡than ¡2,000 ¡check-‑ins ¡ – TransUnion ¡makes ¡nearly ¡70,000 ¡updates ¡to ¡credit ¡files ¡

And ¡every ¡second… ¡

– Banks ¡process ¡more ¡than ¡10,000 ¡credit ¡card ¡transac<ons ¡

Volume ¡

SLIDE 5

This ¡data ¡has ¡many ¡valuable ¡applica<ons ¡

– Product ¡recommenda<ons ¡ – Predic<ng ¡demand ¡ – Marke<ng ¡analysis ¡ – Fraud ¡detec<on ¡ – And ¡many, ¡many ¡more… ¡

We ¡must ¡process ¡it ¡to ¡extract ¡that ¡value ¡

– And ¡processing ¡all ¡the ¡data ¡can ¡yield ¡more ¡ accurate ¡results ¡

Data ¡Has ¡Value ¡

SLIDE 6

We’re ¡genera<ng ¡too ¡much ¡data ¡to ¡process ¡with ¡tradi<onal ¡

tools ¡

Two ¡key ¡problems ¡to ¡address ¡ ¡

– How ¡can ¡we ¡reliably ¡store ¡large ¡amounts ¡of ¡data ¡at ¡a ¡ reasonable ¡cost? ¡ – How ¡can ¡we ¡analyze ¡all ¡the ¡data ¡we ¡have ¡stored? ¡

We ¡Need ¡a ¡System ¡that ¡Scales ¡

SLIDE 7

Scalable ¡and ¡economical ¡data ¡storage ¡and ¡processing ¡

– Distributed ¡and ¡fault-‑tolerant ¡ ¡ – Harnesses ¡the ¡power ¡of ¡industry ¡standard ¡hardware ¡

Heavily ¡inspired ¡by ¡technical ¡documents ¡published ¡by ¡Google ¡
‘Core’ ¡Hadoop ¡consists ¡of ¡two ¡main ¡components ¡

– Storage: ¡the ¡Hadoop ¡Distributed ¡File ¡System ¡(HDFS) ¡ – Processing: ¡MapReduce ¡

Apache ¡Hadoop ¡

SLIDE 8

Apache ¡Pig ¡builds ¡on ¡Hadoop ¡to ¡offer ¡high-‑level ¡

data ¡processing ¡

– This ¡is ¡an ¡alterna<ve ¡to ¡wri<ng ¡low-‑level ¡ MapReduce ¡code ¡ – Pig ¡is ¡especially ¡good ¡at ¡joining ¡and ¡transforming ¡ data ¡

Apache ¡Pig ¡

people = LOAD '/user/training/customers' AS (cust_id, name);

rders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);

groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;

SLIDE 9

Pig ¡is ¡also ¡widely ¡used ¡for ¡Extract, ¡Transform, ¡

and ¡Load ¡(ETL) ¡processing ¡

Use ¡Case: ¡ETL ¡Processing ¡

Operations Validate data Accounting Call Center Fix errors Remove duplicates Encode values Data Warehouse

Pig Jobs Running on Hadoop Cluster

SLIDE 10

Hive ¡is ¡another ¡abstrac<on ¡on ¡top ¡of ¡

MapReduce ¡

– Like ¡Pig, ¡it ¡also ¡reduces ¡development ¡<me ¡ ¡ – Hive ¡uses ¡a ¡SQL-‑like ¡language ¡called ¡HiveQL ¡

Apache ¡Hive ¡

SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;

SLIDE 11

Server ¡log ¡files ¡are ¡an ¡important ¡source ¡of ¡data ¡
Hive ¡allows ¡you ¡to ¡treat ¡a ¡directory ¡of ¡log ¡files ¡

like ¡a ¡table ¡

– Allows ¡SQL-‑like ¡queries ¡against ¡raw ¡data ¡

Use ¡Case: ¡Log ¡File ¡Analy<cs ¡

Dualcore Inc. Public Web Site (June 1 - 8)

Product Unique Visitors Page Views Bounce Rate Conversion Rate Average Time on Page Tablet 5,278 5,894 23% 65% 17 seconds Notebook 4,139 4,375 47% 31% 23 seconds Stereo 2,873 2,981 61% 12% 42 seconds Monitor 1,749 1,862 74% 19% 26 seconds Router 987 1,139 56% 17% 37 seconds Server 314 504 48% 28% 53 seconds Printer 86 97 27% 64% 34 seconds

SLIDE 12

Apache ¡Sqoop ¡

Sqoop ¡exchanges ¡data ¡between ¡a ¡database ¡and ¡Hadoop ¡
It ¡can ¡import ¡all ¡tables, ¡a ¡single ¡table, ¡or ¡a ¡por<on ¡of ¡a ¡table ¡into ¡

HDFS ¡ – Result ¡is ¡a ¡directory ¡in ¡HDFS ¡containing ¡comma-‑delimited ¡text ¡ files ¡

Sqoop ¡can ¡also ¡export ¡data ¡from ¡HDFS ¡back ¡to ¡the ¡database ¡

Database Hadoop Cluster

SLIDE 13

Massively ¡parallel ¡SQL ¡engine ¡which ¡runs ¡on ¡a ¡Hadoop ¡cluster ¡

– Inspired ¡by ¡Google’s ¡Dremel ¡project ¡ – Can ¡query ¡data ¡stored ¡in ¡HDFS ¡or ¡HBase ¡tables ¡

High ¡performance ¡ ¡

– Typically ¡at ¡least ¡10 ¡<mes ¡faster ¡than ¡Pig, ¡Hive, ¡or ¡ MapReduce ¡ – High-‑level ¡query ¡language ¡(subset ¡of ¡SQL) ¡

Impala ¡is ¡100% ¡Apache-‑licensed ¡open ¡source ¡

Cloudera ¡Impala ¡

SLIDE 14

Where ¡Impala ¡Fits ¡Into ¡the ¡Data ¡ Center ¡

Transaction Records from Application Database Log Data from Web Servers

Hadoop Cluster with Impala

Documents from File Server

Analyst using Impala shell for ad hoc queries Analyst using Impala via BI tool

SLIDE 15

MapReduce ¡

– Low-‑level ¡processing ¡and ¡analysis ¡

Pig ¡

– Procedural ¡data ¡flow ¡language ¡executed ¡using ¡MapReduce ¡

Hive ¡

– SQL-‑based ¡queries ¡executed ¡using ¡MapReduce ¡

Impala ¡

– High-‑performance ¡SQL-‑based ¡queries ¡using ¡a ¡custom ¡ execu<on ¡engine ¡

Recap ¡of ¡Data ¡Analysis/Processing ¡ Tools ¡

SLIDE 16

Comparing ¡Pig, ¡Hive, ¡and ¡Impala ¡

Descrip(on ¡of ¡Feature ¡ Pig ¡ Hive ¡ Impala ¡ SQL-based query language No Yes Yes User-defined functions (UDFs) Yes Yes No Process data with external scripts Yes Yes No Extensible file format support Yes Yes No Complex data types Yes Yes No Query latency High High Low Built-in data partitioning No Yes Yes Accessible via ODBC / JDBC No Yes Yes

SLIDE 17

What ¡kinds ¡of ¡NoSQL ¡

NoSQL ¡solu<ons ¡fall ¡into ¡two ¡major ¡areas: ¡

– Key/Value ¡or ¡‘the ¡big ¡hash ¡table’. ¡

Amazon ¡S3 ¡(Dynamo) ¡
Voldemort ¡
Scalaris ¡
Memcached ¡(in-‑memory ¡key/value ¡store) ¡
Redis ¡ ¡

– Schema-‑less ¡which ¡comes ¡in ¡mul<ple ¡flavors, ¡column-‑based, ¡ document-‑based ¡or ¡graph-‑based. ¡

Cassandra ¡(column-‑based) ¡
CouchDB ¡(document-‑based) ¡
MongoDB(document-‑based) ¡
Neo4J ¡(graph-‑based) ¡
HBase ¡(column-‑based) ¡ ¡

SLIDE 18

Key/Value ¡

Pros: ¡

– very ¡fast ¡ – very ¡scalable ¡ – simple ¡model ¡ – able ¡to ¡distribute ¡horizontally ¡ ¡

Cons: ¡ ¡

‑ ¡many ¡data ¡structures ¡(objects) ¡can't ¡be ¡easily ¡modeled ¡as ¡key ¡

value ¡pairs ¡ ¡

SLIDE 19

Schema-‑Less ¡

Pros: ¡

‑ ¡Schema-‑less ¡data ¡model ¡is ¡richer ¡than ¡key/value ¡pairs ¡
‑ eventual ¡consistency ¡
‑ many ¡are ¡distributed ¡
‑ s<ll ¡provide ¡excellent ¡performance ¡and ¡scalability ¡

¡

Cons: ¡ ¡

‑ ¡typically ¡no ¡ACID ¡transac<ons ¡or ¡joins ¡ ¡

SLIDE 20

Common ¡Advantages ¡

Cheap, ¡easy ¡to ¡implement ¡(open ¡source) ¡
Data ¡are ¡replicated ¡to ¡mul<ple ¡nodes ¡(therefore ¡iden<cal ¡and ¡

fault-‑tolerant) ¡and ¡can ¡be ¡par<<oned ¡ – Down ¡nodes ¡easily ¡replaced ¡ – No ¡single ¡point ¡of ¡failure ¡

Easy ¡to ¡distribute ¡
Don't ¡require ¡a ¡schema ¡
Can ¡scale ¡up ¡and ¡down ¡
Relax ¡the ¡data ¡consistency ¡requirement ¡(CAP) ¡

SLIDE 21

What ¡am ¡I ¡giving ¡up? ¡

joins ¡
group ¡by ¡
order ¡by ¡
ACID ¡transac<ons ¡
SQL ¡as ¡a ¡some<mes ¡frustra<ng ¡but ¡s<ll ¡powerful ¡query ¡

language ¡

easy ¡integra<on ¡with ¡other ¡applica<ons ¡that ¡support ¡SQL ¡

SLIDE 22

Big ¡Table ¡and ¡Hbase ¡ ¡

SLIDE 23

Data ¡Model ¡

A table in Bigtable is a sparse, distributed, persistent

multidimensional sorted map

Map indexed by a ¡row ¡key, ¡column ¡key, ¡and ¡a ¡<mestamp ¡

– (row:string, ¡column:string, ¡<me:int64) ¡-‑ ¡uninterpreted ¡ byte ¡array ¡

Supports ¡lookups, ¡inserts, ¡deletes ¡

– Single ¡row ¡transac<ons ¡only ¡

Image ¡Source: ¡Chang ¡et ¡al., ¡OSDI ¡2006 ¡

SLIDE 24

Rows ¡and ¡Columns ¡

Rows ¡maintained ¡in ¡sorted ¡lexicographic ¡order ¡

– Applica<ons ¡can ¡exploit ¡this ¡property ¡for ¡efficient ¡row ¡ scans ¡ – Row ¡ranges ¡dynamically ¡par<<oned ¡into ¡tablets ¡

Columns ¡grouped ¡into ¡column ¡families ¡

– Column ¡key ¡= ¡family:qualifier ¡ – Column ¡families ¡provide ¡locality ¡hints ¡ – Unbounded ¡number ¡of ¡columns ¡

SLIDE 25

HBase ¡is ¡.. ¡

A ¡distributed ¡data ¡store ¡that ¡can ¡scale ¡horizontally ¡to ¡

1,000s ¡of ¡commodity ¡servers ¡and ¡petabytes ¡of ¡ indexed ¡storage. ¡

Designed ¡to ¡operate ¡on ¡top ¡of ¡the ¡Hadoop ¡

distributed ¡file ¡system ¡(HDFS) ¡or ¡Kosmos ¡File ¡System ¡ (KFS, ¡aka ¡Cloudstore) ¡for ¡scalability, ¡fault ¡tolerance, ¡ and ¡high ¡availability. ¡

SLIDE 26

Benefits

Distributed ¡storage ¡
Table-‑like ¡in ¡data ¡structure ¡ ¡

– mul<-‑dimensional ¡sorted ¡map ¡

High ¡scalability ¡
High ¡availability ¡
High ¡performance ¡

SLIDE 27

HBase ¡Is ¡Not ¡…

Tables ¡have ¡one ¡primary ¡index, ¡the ¡row ¡key. ¡
No ¡join ¡operators. ¡
Scans ¡and ¡queries ¡can ¡select ¡a ¡subset ¡of ¡available ¡columns, ¡

perhaps ¡by ¡using ¡a ¡wildcard. ¡

There ¡are ¡three ¡types ¡of ¡lookups: ¡

– Fast ¡lookup ¡using ¡row ¡key ¡and ¡op<onal ¡<mestamp. ¡ – Full ¡table ¡scan ¡ – Range ¡scan ¡from ¡region ¡start ¡to ¡end.

SLIDE 28

HBase ¡Is ¡Not ¡…(2)

Limited ¡atomicity ¡and ¡transac<on ¡support. ¡

– HBase ¡supports ¡mul<ple ¡batched ¡muta<ons ¡of ¡single ¡rows ¡

nly. ¡

– Data ¡is ¡unstructured ¡and ¡untyped. ¡

Not ¡accessed ¡or ¡manipulated ¡via ¡SQL. ¡

– Programma<c ¡access ¡via ¡Java, ¡REST, ¡or ¡Thrio ¡APIs. ¡ – Scrip<ng ¡via ¡JRuby.

SLIDE 29

Why ¡Bigtable? ¡

Performance ¡of ¡RDBMS ¡system ¡is ¡good ¡for ¡transac<on ¡

processing ¡but ¡for ¡very ¡large ¡scale ¡analy<c ¡processing, ¡the ¡ solu<ons ¡are ¡commercial, ¡expensive, ¡and ¡specialized. ¡

Very ¡large ¡scale ¡analy<c ¡processing ¡

– Big ¡queries ¡– ¡typically ¡range ¡or ¡table ¡scans. ¡ – Big ¡databases ¡(100s ¡of ¡TB)

SLIDE 30

Why ¡HBase ¡? ¡

HBase ¡is ¡a ¡Bigtable ¡clone. ¡
It ¡is ¡open ¡source ¡
It ¡has ¡a ¡good ¡community ¡and ¡promise ¡for ¡the ¡future ¡
It ¡is ¡developed ¡on ¡top ¡of ¡and ¡has ¡good ¡integra<on ¡for ¡the ¡

Hadoop ¡plaqorm, ¡if ¡you ¡are ¡using ¡Hadoop ¡already. ¡

SLIDE 31

HBase ¡benefits? ¡than ¡RDBMS

No ¡real ¡indexes ¡
Automa;c ¡par;;oning ¡
Scale ¡linearly ¡and ¡automa;cally ¡with ¡new ¡nodes ¡
Commodity ¡hardware ¡
Fault ¡tolerance ¡
Batch ¡processing

¡ Using ¡Pig, ¡Hive, ¡and ¡Impala ¡with ¡ Hadoop ¡ ¡

Jay ¡Urbain, ¡PhD ¡

– Processes ¡are ¡increasingly ¡automated ¡ – People ¡are ¡increasingly ¡interac<ng ¡online ¡ – Systems ¡are ¡increasingly ¡interconnected ¡

Velocity ¡

– Social ¡network ¡connec<ons ¡ – Images, ¡audio, ¡and ¡video ¡ – Server ¡and ¡applica<on ¡log ¡files ¡ – Product ¡ra<ngs ¡on ¡shopping ¡and ¡review ¡Web ¡sites ¡ – And ¡much ¡more… ¡

Variety ¡

– More ¡than ¡1.5 ¡billion ¡shares ¡are ¡traded ¡on ¡the ¡New ¡York ¡ Stock ¡Exchange ¡ – Facebook ¡stores ¡2.7 ¡billion ¡comments ¡and ¡‘Likes’ ¡ – Google ¡processes ¡about ¡24 ¡petabytes ¡of ¡data ¡

– Foursquare ¡handles ¡more ¡than ¡2,000 ¡check-­‑ins ¡ – TransUnion ¡makes ¡nearly ¡70,000 ¡updates ¡to ¡credit ¡files ¡

– Banks ¡process ¡more ¡than ¡10,000 ¡credit ¡card ¡transac<ons ¡

Volume ¡

– Product ¡recommenda<ons ¡ – Predic<ng ¡demand ¡ – Marke<ng ¡analysis ¡ – Fraud ¡detec<on ¡ – And ¡many, ¡many ¡more… ¡

– And ¡processing ¡all ¡the ¡data ¡can ¡yield ¡more ¡ accurate ¡results ¡

Data ¡Has ¡Value ¡

tools ¡

– How ¡can ¡we ¡reliably ¡store ¡large ¡amounts ¡of ¡data ¡at ¡a ¡ reasonable ¡cost? ¡ – How ¡can ¡we ¡analyze ¡all ¡the ¡data ¡we ¡have ¡stored? ¡

We ¡Need ¡a ¡System ¡that ¡Scales ¡

– Distributed ¡and ¡fault-­‑tolerant ¡ ¡ – Harnesses ¡the ¡power ¡of ¡industry ¡standard ¡hardware ¡

– Storage: ¡the ¡Hadoop ¡Distributed ¡File ¡System ¡(HDFS) ¡ – Processing: ¡MapReduce ¡

Apache ¡Hadoop ¡

data ¡processing ¡

– This ¡is ¡an ¡alterna<ve ¡to ¡wri<ng ¡low-­‑level ¡ MapReduce ¡code ¡ – Pig ¡is ¡especially ¡good ¡at ¡joining ¡and ¡transforming ¡ data ¡

Apache ¡Pig ¡

and ¡Load ¡(ETL) ¡processing ¡

Use ¡Case: ¡ETL ¡Processing ¡

MapReduce ¡

– Like ¡Pig, ¡it ¡also ¡reduces ¡development ¡<me ¡ ¡ – Hive ¡uses ¡a ¡SQL-­‑like ¡language ¡called ¡HiveQL ¡

Apache ¡Hive ¡

like ¡a ¡table ¡

– Allows ¡SQL-­‑like ¡queries ¡against ¡raw ¡data ¡

Use ¡Case: ¡Log ¡File ¡Analy<cs ¡

Apache ¡Sqoop ¡

HDFS ¡ – Result ¡is ¡a ¡directory ¡in ¡HDFS ¡containing ¡comma-­‑delimited ¡text ¡ files ¡

– Inspired ¡by ¡Google’s ¡Dremel ¡project ¡ – Can ¡query ¡data ¡stored ¡in ¡HDFS ¡or ¡HBase ¡tables ¡

– Typically ¡at ¡least ¡10 ¡<mes ¡faster ¡than ¡Pig, ¡Hive, ¡or ¡ MapReduce ¡ – High-­‑level ¡query ¡language ¡(subset ¡of ¡SQL) ¡

Cloudera ¡Impala ¡

Where ¡Impala ¡Fits ¡Into ¡the ¡Data ¡ Center ¡

– Low-­‑level ¡processing ¡and ¡analysis ¡

– Procedural ¡data ¡flow ¡language ¡executed ¡using ¡MapReduce ¡

– SQL-­‑based ¡queries ¡executed ¡using ¡MapReduce ¡

– High-­‑performance ¡SQL-­‑based ¡queries ¡using ¡a ¡custom ¡ execu<on ¡engine ¡

Recap ¡of ¡Data ¡Analysis/Processing ¡ Tools ¡

Comparing ¡Pig, ¡Hive, ¡and ¡Impala ¡

What ¡kinds ¡of ¡NoSQL ¡

– Key/Value ¡or ¡‘the ¡big ¡hash ¡table’. ¡

– Schema-­‑less ¡which ¡comes ¡in ¡mul<ple ¡flavors, ¡column-­‑based, ¡ document-­‑based ¡or ¡graph-­‑based. ¡

Key/Value ¡

Pros: ¡

– very ¡fast ¡ – very ¡scalable ¡ – simple ¡model ¡ – able ¡to ¡distribute ¡horizontally ¡ ¡

Cons: ¡ ¡

value ¡pairs ¡ ¡

Schema-­‑Less ¡

Pros: ¡

¡

Cons: ¡ ¡

Common ¡Advantages ¡

fault-­‑tolerant) ¡and ¡can ¡be ¡par<<oned ¡ – Down ¡nodes ¡easily ¡replaced ¡ – No ¡single ¡point ¡of ¡failure ¡

What ¡am ¡I ¡giving ¡up? ¡

language ¡

Big ¡Table ¡and ¡Hbase ¡ ¡

Data ¡Model ¡

multidimensional sorted map

– (row:string, ¡column:string, ¡<me:int64) ¡-­‑ ¡uninterpreted ¡ byte ¡array ¡

– Single ¡row ¡transac<ons ¡only ¡

Rows ¡and ¡Columns ¡

– Applica<ons ¡can ¡exploit ¡this ¡property ¡for ¡efficient ¡row ¡ scans ¡ – Row ¡ranges ¡dynamically ¡par<<oned ¡into ¡tablets ¡

– Column ¡key ¡= ¡family:qualifier ¡ – Column ¡families ¡provide ¡locality ¡hints ¡ – Unbounded ¡number ¡of ¡columns ¡

HBase ¡is ¡.. ¡

1,000s ¡of ¡commodity ¡servers ¡and ¡petabytes ¡of ¡ indexed ¡storage. ¡

distributed ¡file ¡system ¡(HDFS) ¡or ¡Kosmos ¡File ¡System ¡ (KFS, ¡aka ¡Cloudstore) ¡for ¡scalability, ¡fault ¡tolerance, ¡ and ¡high ¡availability. ¡

Benefits

– mul<-­‑dimensional ¡sorted ¡map ¡

HBase ¡Is ¡Not ¡…

perhaps ¡by ¡using ¡a ¡wildcard. ¡

– Fast ¡lookup ¡using ¡row ¡key ¡and ¡op<onal ¡<mestamp. ¡ – Full ¡table ¡scan ¡ – Range ¡scan ¡from ¡region ¡start ¡to ¡end.

HBase ¡Is ¡Not ¡…(2)

– HBase ¡supports ¡mul<ple ¡batched ¡muta<ons ¡of ¡single ¡rows ¡

– Data ¡is ¡unstructured ¡and ¡untyped. ¡

– Programma<c ¡access ¡via ¡Java, ¡REST, ¡or ¡Thrio ¡APIs. ¡ – Scrip<ng ¡via ¡JRuby.

Why ¡Bigtable? ¡

processing ¡but ¡for ¡very ¡large ¡scale ¡analy<c ¡processing, ¡the ¡ solu<ons ¡are ¡commercial, ¡expensive, ¡and ¡specialized. ¡

– Foursquare ¡handles ¡more ¡than ¡2,000 ¡check-‑ins ¡ – TransUnion ¡makes ¡nearly ¡70,000 ¡updates ¡to ¡credit ¡files ¡

– Distributed ¡and ¡fault-‑tolerant ¡ ¡ – Harnesses ¡the ¡power ¡of ¡industry ¡standard ¡hardware ¡

– This ¡is ¡an ¡alterna<ve ¡to ¡wri<ng ¡low-‑level ¡ MapReduce ¡code ¡ – Pig ¡is ¡especially ¡good ¡at ¡joining ¡and ¡transforming ¡ data ¡

– Like ¡Pig, ¡it ¡also ¡reduces ¡development ¡<me ¡ ¡ – Hive ¡uses ¡a ¡SQL-‑like ¡language ¡called ¡HiveQL ¡

– Allows ¡SQL-‑like ¡queries ¡against ¡raw ¡data ¡

HDFS ¡ – Result ¡is ¡a ¡directory ¡in ¡HDFS ¡containing ¡comma-‑delimited ¡text ¡ files ¡

– Typically ¡at ¡least ¡10 ¡<mes ¡faster ¡than ¡Pig, ¡Hive, ¡or ¡ MapReduce ¡ – High-‑level ¡query ¡language ¡(subset ¡of ¡SQL) ¡

– Low-‑level ¡processing ¡and ¡analysis ¡

– SQL-‑based ¡queries ¡executed ¡using ¡MapReduce ¡

– High-‑performance ¡SQL-‑based ¡queries ¡using ¡a ¡custom ¡ execu<on ¡engine ¡

– Schema-‑less ¡which ¡comes ¡in ¡mul<ple ¡flavors, ¡column-‑based, ¡ document-‑based ¡or ¡graph-‑based. ¡

Schema-‑Less ¡

fault-‑tolerant) ¡and ¡can ¡be ¡par<<oned ¡ – Down ¡nodes ¡easily ¡replaced ¡ – No ¡single ¡point ¡of ¡failure ¡

– (row:string, ¡column:string, ¡<me:int64) ¡-‑ ¡uninterpreted ¡ byte ¡array ¡

– mul<-‑dimensional ¡sorted ¡map ¡