[PPT] - Scientific data management and analysis Data acquired fr from the PowerPoint Presentation

SLIDE 1

Scientific data management and analysis – Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST)

Amin Mesmoudi

1

SLIDE 2

Context

CNRS-Mastodons 2012-
LSST Project

2

SLIDE 3

LSST needs in storage and data access (1/2)

Storage:

3

Table Size #Records #Attributes Object 109 TB 38 B 470 Moving Object 5 GB 6 M 100 Source 3.6 PB 5 T 125 Forced Source 1.1 PB 32 T 7 Difference Image Source 71 TB 200 B 65 CCD Exposure 0.6 TB 17 B 45

1 trillion=1018

SLIDE 4

LSST needs in storage and data access (2/2)

Access
Declarative queries (SQL)
With possibility to define new adhoc functions (UDF)
Example: areaspec_box, angDist < dist
500,000 queries per day

4

SELECT objectId, taiMidPoint, fluxToAbMag(psfMag) FROM Source JOIN IN Object USING(objectId) JOIN Filter USING(filterId) WHERE areaSpec_box(:raMin, :declMin, :raMax, :declMax) AND filterName = 'u' AND variability BETWEEN :varMin AND :varMax ORDER R BY objectId, taiMidPoint ASC

SLIDE 5

Project Objectives

Propose a distributed architecture able to store +100 PB of data
Open Source
Shared-Nothing
Have the possibility to evaluate both simple queries (a few seconds of

computation) and complex queries (days of computation)

Ability to access objects using indexes or by a full scan on large tables

(>> 1 PB)

Study the capacity of existing systems to meet the needs of the LSST

project

Benchmark

5

SLIDE 6

Outline

Context
MapReduce
SQL on MapReduce
Benchmark
Data sets
Queries
Experiments
Summary

6

SLIDE 7

MapReduce (1/2)

“MapReduce is a simplified parallel data processing approach for

execution on a computer cluster” [Dean and Ghemawat 2004].

Programmer specifies only two functions:
Process <key,value> as input
Produce a set of <key2,value> as intermediate results
Combine intermediate values with a unique key
Produce a set of values as output

map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value)

7

SLIDE 8

SourceID ObjectID RA DECL ExposureID 1 1 10

3

1 5 2 15

4

1 9 2 20

5

5 13 4 25

7

1 2 1 40

7

2 6 2 45

8

2 10 3 50

9

2 14 4 55

13

3 3 1 60

10

3 7 2 65

11

3 11 3 70

12

3 15 3 75

15

4 4 1 80

12

4 8 2 85

13

4 12 3 90

14

5 16 4 95

20

6 ObjectID freq 1 4 2 5 3 4 4 3

MapReduce (2/2)

8

SLIDE 9

SourceID ObjectID RA DECL ExposureID 1 1 10

3

1 5 2 15

4

1 9 2 20

5

5 13 4 25

7

1 SourceID ObjectID RA DECL ExposureID 2 1 40

7

2 6 2 45

8

2 10 3 50

9

2 14 4 55

13

3 SourceID ObjectID RA DECL ExposureID 3 1 60

10

3 7 2 65

11

3 11 3 70

12

3 15 3 75

15

4 SourceID ObjectID RA DECL ExposureID 4 1 80

12

4 8 2 85

13

4 12 3 90

14

5 16 4 95

20

6

Node 01 Node 02 Node 03 Node 04 Mapper Shuffle and Sort Combiner

1 1 2 1 2 1 4 1

Mapper Mapper Mapper Combiner Combiner Combiner

Partitioner Partitioner Partitioner Partitioner

Reducer Reducer Reducer

1 1 2 1 3 1 4 1 1 1 2 1 3 1 3 1 1 1 2 1 3 1 4 1 1 1 2 2 4 1 1 1 2 1 3 1 4 1 1 1 2 1 3 2 1 1 2 1 3 1 4 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 2 1 3 1 3 2 3 1 4 1 4 1 4 1 1 4 2 5 3 4 4 3 ObjectID freq 1 4 2 5 3 4 4 3

9

ObjectID freq ObjectID freq ObjectID freq ObjectID freq

SLIDE 10

DFS Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 MapReduce Data loader Jobs executor Query processing Query

SQL-On-MapReduce

10

SLIDE 11

DFS Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 MapReduce Data loader Jobs executor Query processing Query

SQL-On-MapReduce

11

SQL Completeness (L, M, H suggests range of support) SQL-on-Hadoop Technology Operations SQL DDL/DML Packaged Analytic functions UDFs/Custom functions Hive M M L H HadoopDB/Hadapt M M L L Drill M L

L

Impala M L

L

Presto M

L
Spark/Shark

M M

H

SLIDE 12

Benchmark – Why ?

http://spark.apache.org/ http://blog.cloudera.com/blog/2014/01/impala-performance- dbms-class-speed/

12

SLIDE 13

Benchmark – What?

A set of resources and methods (situations) allowing to compare

efficiency of systems and approaches offering the same functionality

Resources: Data sets, Queries, Parameters, Machines, …
Methods: indexed data vs. non-indexed, Smart partitioning vs. Simple

partitioning, selective queries vs. non selective queries, increasing the number of machines, compressed data, …

13

SLIDE 14

Data sets

DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB PT 1.2 250 GB

14

SLIDE 15

Data sets

DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB Table #attributes #records #size Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp

sure_Metadata

6 41 (m) 16 GB PT 1.2 250 GB Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB

15

SLIDE 16

Data sets

DataSet Source: #records Source: Size (GB) Object: #records Object: Size (GB)

PT 1.2 250 GB 325 (m) 236 9 (m) 14 PT 1.2 500 GB 650 (m) 472 18 (m) 28 PT 1.2 1 TB 1.3 (b) 944 36 (m) 56 PT 1.2 2 TB 2.6 (b) 1888 72 (m) 112

DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB Table #attributes #records #size Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp

sure_Metadata

6 41 (m) 16 GB

DataSet SourceID ObjectID DECL RA scienceCcd ExposureId D250GB 325 (m) 9 (m) 325 (m) 162 (m) 84 (k) D500GB 650 (m) 18 (m) 650 (m) 162 (m) 84 (k) D1TB 1.3 (b) 36 (m) 1,1 (b) 162 (m) 84 (k) D2TB 2.6 (b) 72 (m) 2,3 (b) 162 (m) 84 (k)

PT 1.2 250 GB

+

Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB

16

SLIDE 17

Queries(1/2)

id SQL syntaxe Q1 select * from source where sourceid=29785473054213321; Q2 select sourceid, ra,decl from source where objectid=402386896042823; Q3 select sourceid, objectid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q4 select sourceid, ra,decl from source where scienceccdexposureid=454490250461; Q5 select objectid,count(sourceid) from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2 group by objectid; Q6 select objectid,count(sourceid) from source group by objectid; Q7 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q8 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96; Q9 SELECT s.psfFlux, s.psfFluxSigma, sce.exposureType FROM Source s JOIN RefSrcMatch rsm ON (s.sourceId = rsm.sourceId) JOIN Science_Ccd_Exposure_Metadata sce ON (s.scienceCcdExposureId = sce.scienceCcdExposureId) WHERE s.ra > 359.959 and s.ra < 359.96 and s.decl < 2.05 and s.decl > 2 and s.filterId = 2 and rsm.refObjectId is not NULL; Selection Group By Join

17

SLIDE 18

Queries (1/2)

id SQL syntaxe Q10 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2

rder by objectid;

Q11 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 order by objectid; Q12 select id FROM rundeepforcedsource where areaspec_box(coord_ra,coord_decl,-55,-2,55,2)=1; Q13 select fluxToAbMag(flux_naive) from rundeepforcedsource where objectid=1398583353936135; Order by UDF

18

SLIDE 19

Predicates selectivity

DataSet SourceID =id ObjectID =id 2<DECL<2.05 359.959 <ra < 359.96 359.959 <ra < 359.96 and 2 <decl < 2.05 scienceCcdExposureId D250GB 1 43 1,6 (m) 14 (k) 21 3.6 (k) D500GB 1 43 3,3 (m) 28 (k) 43 7.3 (k) D1TB 1 43 6,6 (m) 57 (k) 86 14.6 (k) D2TB 1 43 13,2 (m) 127 (k) 172 29,2 (k)

19

SLIDE 20

Experiments’ environment

Material:
10 machines DELL C6220
2 clusters of 25 and 50 virtual machines (8 GB RAM, 2 cores and 300 GB of

disk space)

PT 1.2: 250 GB, 500 GB, 1 TB et 2 TB
Systems: Hive and HadoopDB

20

Available resources RAM 1 TB Disk space 52 TB #Processors 240

SLIDE 21

Hive and HadoopDB

21

Map/reduce Map/reduce Map/reduce

HDFS HIVE Map/reduce Map/reduce Map/reduce HDFS

HIVE* (modified)

RDBMS RDBMS

HadoopDB Hive

Job 1 Job 2 Job 3 Job 4 Job 5 Job 6 Store result

f whole job

SLIDE 22

Parameters to be measured

Tuning: Data loading time (indexing, partitioning, ...)
Performance: Total query execution time
Fault tolerance: Number of faults handled by the tool
Latency: time needed to have the first response
Situations
Scalability (data volume):
250 GB  500 GB 1 TB  2 TB
Hardware Evolution:
25 machines  50 machines
Indexed data Vs. non-indexed data
Different partitioning schemas

22

SLIDE 23

Lessons learned (1/8)

Hive vs HadoopDB
HadoopDB offers a Custom Partitioning
More data => more time to change data partitioning schema

(scalability)

More hardware (machines) => less time to partition the data

(hardware speed up)

23

200 400 600 800 1000 1200 1400 1600 1800 2000 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines

minutes

Hive (HDFS) global hash local hash tuning

SLIDE 24

Lessons learned (2/8)

1st strategy (Hive): Global and distributed index
Hive: index size (GB) vs. index creation time (Seconds)
More distinct values for an attribute => more space to store the index (e.g., Decl)
We can take advantage from hardware to accelerate the creation of indexes (50 Vs 25 machines

machines)

Creation time: 8 mins - 2 hours for 25 machines and 5 mins - 1h 30 for 50 machines

24

1000 2000 3000 4000 5000 6000 7000 8000 250GB 500 GB 1TB 2 TB 250GB 500 GB 1TB 2 TB 25 machines 50 machines

Time (seconds)

Index creation

Sourceid

bjectid

ra decl scienceccdexposure 50 100 150 200 250GB 500 GB 1TB 2 TB

Size (GB)

Index size

Sourceid

bjectid

ra decl scienceccdexposure

SLIDE 25

Lessons learned (3/8)

2nd strategy (HadoopDB): each node manages indexes of data it hosts
Index creation time Vs the total data loading time
Index creation task can can be parallelized which allow us to take advantage of available resources
We can take advantages of available hardware to accelerate the creation time (50 Vs 25 machines

machines)

Creation time: ~ 10 minutes for 5 indexes

25

500 1000 1500 2000 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines

Index creation (minutes)

Load time indexation

SLIDE 26

500 1000 1500 2000 HadoopDB Hive HadoopDB Hive HadoopDB Hive 250 GB 500 GB 1 TB

Time (seconds)

Q1 Q2 Q3 Q4 50 100 150 200 Hive HadoopDB Hive HadoopDB Hive HadoopDB 250 Gb 500 GB 1 TB

With Index

Q1 Q2 Q3 (RA) Q4

Performance for selection tasks (seconds)
50 machines (400 GB of RAM)
Non-indexed data (max 30 mins) Vs indexed data (max 3 mins)
Without index: HadoopDB (+) vs Hive (-) except for the dataset with 250 GB
With Index: Hive (+) vs HadoopDB (-) except for Q3 queries and the data set with 1TB where Hive has failed to use

the index because the size of filtered part is greater than the available RAM (> 8GB)

Scaling:
Without Index: 250 GB (max 4 mins) Vs 500GB (max 10 mins) Vs 1 TB (max 30 mins)
With Index: 250 GB (max 2 mins) Vs 500GB (max 3 mins) Vs 1 TB (max 3.5 mins)

26

Lessons learned (4/8)

SLIDE 27

500 1000 1500 2000 2500 3000 3500 4000 4500 Hive Hive wth Index HadoopDB HadoopDB wth index Hive Hive wth index HadoopDb HadoopDb wth index Hive Hive wth index HadoopDB HadoopDB wth index 250 go 500 go 1 To

Group by

Q5 Q6

Non-indexed data (max 1 hour 8 mins) Vs Indexed Data (max 3 mins)
Without index: HadoopDB (+) vs Hive (-) for selective queries (e.g., Q5) and HadoopDB (-) vs Hive (+) for non-selective queries (e.g.,

Q6)

With Index: Hive systematically denied the use of the index for non-selective queries, HadoopDB (+) vs Hive (-) for selective queries

and HadoopDB does not use the index for non-selective queries

Scalability:
Without Index: 250 GB (max 20 mins) Vs 500GB (max 40 mins) Vs 1 TB (max 1h10 mins)
With Index: 250 GB (max 15 mins) Vs 500GB (max 35 mins) Vs 1 TB (max 1H8 mins)
Hive parallelizes group by processing => Hive without index is significantly better than HadoopDB with Index

27

Lessons learned (5/8)

SLIDE 28

Optimization with HadoopDB: Change partitioning attribute
Minimization of network cost (number of transferred records from
ne machine to another)
A gain up to 500% of the execution time

28

500 1000 1500 2000 2500 3000 3500 4000 4500 HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index HadoopDb HadoopDb wth index HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index SourceID ObjectID SourceID ObjectID SourceID ObjectID 250 go 500 go 1 To

HadoopDB with a custom partitioning

Q5 Q6

Lessons learned (6/8)

SLIDE 29

Performance with join tasks (seconds)
HadoopDB does not support joins with more than 2 tables
Hive does not use indexes to evaluate queries with joins
Indexed data (>> 2h) vs. non-indexed data (>>> 2h)
Hive is significantly better than HadoopDB even when HadoopDB uses indexes
Hive parallelize the processing of joins
HadoopDB data collects and make a join using only a single machine
HadoopDB does not use the index for non-selective queries

29

1000 2000 3000 4000 5000 6000 Hive HadoopDB HadoopDB wth index Hive HadoopDB HadoopDB wth index Hive HadoopDB HadoopDB wth index 250 GB 500 GB 1 TB

Time (seconds)

Join tasks

Q7 Q8 Q9

Lessons learned (7/8)

SLIDE 30

ORDER BY tasks performances (seconds)
50 machines (400 GB of RAM)
Hive does not allow the use of an index to evaluate ORDER BY queries
Non-indexed data (max 30 mins) Vs indexed data (max 1 min)
Hive Vs HadoopDB
HadoopDB is significantly better than Hive with and without indexes
The ORDER BY is not a paralilisable task within MapReduce model

30

500 1000 1500 2000 Hive Hive wth index HadoopDB HadoopDB index Hive Hive wth index HadoopDB HadoopDB index Hive Hive wth index HadoopDB HadoopDB index 250 Go 500 Go 1 To

Order by tasks

Q10 Q11

Lessons learned (8/8)

SLIDE 31

Summary

Cost (q)= 𝑙=1

𝑜−1 𝐷𝑝𝑡𝑢 𝑘𝑝𝑐𝑙 + 𝑑𝑝𝑡𝑢(𝑘𝑝𝑐𝑜) − 𝑑𝑝𝑡𝑢𝑋𝑠(𝑘𝑝𝑐𝑜)

Cost(job)=cost(Map)+cost(Network)+cost(Reduce)+cost(write results)
Optimizations:
N: jobs
Cost (Map): indexing, compression, read data using many « Mappers »
Cost (Network): #<key, values> minimization, intermediate results, . . .
Partitioning
Cost (Reduce): parallel processing (# Reducers), Minimizing the size of intermediate

results

http://com.isima.fr/Petasky/hive-vs-hadoopdb
http://petasky.univ-lyon1.fr/scale/

31