Scientific data management and analysis Data acquired fr from the - - PowerPoint PPT Presentation

scientific data management and analysis data acquired fr
SMART_READER_LITE
LIVE PREVIEW

Scientific data management and analysis Data acquired fr from the - - PowerPoint PPT Presentation

Scientific data management and analysis Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST) Amin Mesmoudi 1 Context CNRS-Mastodons 2012- LSST Project 2 LSST needs in storage and data access (1/2) Storage:


slide-1
SLIDE 1

Scientific data management and analysis – Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST)

Amin Mesmoudi

1

slide-2
SLIDE 2

Context

  • CNRS-Mastodons 2012-
  • LSST Project

2

slide-3
SLIDE 3

LSST needs in storage and data access (1/2)

  • Storage:

3

Table Size #Records #Attributes Object 109 TB 38 B 470 Moving Object 5 GB 6 M 100 Source 3.6 PB 5 T 125 Forced Source 1.1 PB 32 T 7 Difference Image Source 71 TB 200 B 65 CCD Exposure 0.6 TB 17 B 45

1 trillion=1018

slide-4
SLIDE 4

LSST needs in storage and data access (2/2)

  • Access
  • Declarative queries (SQL)
  • With possibility to define new adhoc functions (UDF)
  • Example: areaspec_box, angDist < dist
  • 500,000 queries per day

4

SELECT objectId, taiMidPoint, fluxToAbMag(psfMag) FROM Source JOIN IN Object USING(objectId) JOIN Filter USING(filterId) WHERE areaSpec_box(:raMin, :declMin, :raMax, :declMax) AND filterName = 'u' AND variability BETWEEN :varMin AND :varMax ORDER R BY objectId, taiMidPoint ASC

slide-5
SLIDE 5

Project Objectives

  • Propose a distributed architecture able to store +100 PB of data
  • Open Source
  • Shared-Nothing
  • Have the possibility to evaluate both simple queries (a few seconds of

computation) and complex queries (days of computation)

  • Ability to access objects using indexes or by a full scan on large tables

(>> 1 PB)

  • Study the capacity of existing systems to meet the needs of the LSST

project

  • Benchmark

5

slide-6
SLIDE 6

Outline

  • Context
  • MapReduce
  • SQL on MapReduce
  • Benchmark
  • Data sets
  • Queries
  • Experiments
  • Summary

6

slide-7
SLIDE 7

MapReduce (1/2)

  • “MapReduce is a simplified parallel data processing approach for

execution on a computer cluster” [Dean and Ghemawat 2004].

  • Programmer specifies only two functions:
  • Process <key,value> as input
  • Produce a set of <key2,value> as intermediate results
  • Combine intermediate values with a unique key
  • Produce a set of values as output

map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value)

7

slide-8
SLIDE 8

SourceID ObjectID RA DECL ExposureID 1 1 10

  • 3

1 5 2 15

  • 4

1 9 2 20

  • 5

5 13 4 25

  • 7

1 2 1 40

  • 7

2 6 2 45

  • 8

2 10 3 50

  • 9

2 14 4 55

  • 13

3 3 1 60

  • 10

3 7 2 65

  • 11

3 11 3 70

  • 12

3 15 3 75

  • 15

4 4 1 80

  • 12

4 8 2 85

  • 13

4 12 3 90

  • 14

5 16 4 95

  • 20

6 ObjectID freq 1 4 2 5 3 4 4 3

MapReduce (2/2)

8

slide-9
SLIDE 9

SourceID ObjectID RA DECL ExposureID 1 1 10

  • 3

1 5 2 15

  • 4

1 9 2 20

  • 5

5 13 4 25

  • 7

1 SourceID ObjectID RA DECL ExposureID 2 1 40

  • 7

2 6 2 45

  • 8

2 10 3 50

  • 9

2 14 4 55

  • 13

3 SourceID ObjectID RA DECL ExposureID 3 1 60

  • 10

3 7 2 65

  • 11

3 11 3 70

  • 12

3 15 3 75

  • 15

4 SourceID ObjectID RA DECL ExposureID 4 1 80

  • 12

4 8 2 85

  • 13

4 12 3 90

  • 14

5 16 4 95

  • 20

6

Node 01 Node 02 Node 03 Node 04 Mapper Shuffle and Sort Combiner

1 1 2 1 2 1 4 1

Mapper Mapper Mapper Combiner Combiner Combiner

Partitioner Partitioner Partitioner Partitioner

Reducer Reducer Reducer

1 1 2 1 3 1 4 1 1 1 2 1 3 1 3 1 1 1 2 1 3 1 4 1 1 1 2 2 4 1 1 1 2 1 3 1 4 1 1 1 2 1 3 2 1 1 2 1 3 1 4 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 2 1 3 1 3 2 3 1 4 1 4 1 4 1 1 4 2 5 3 4 4 3 ObjectID freq 1 4 2 5 3 4 4 3

9

ObjectID freq ObjectID freq ObjectID freq ObjectID freq

slide-10
SLIDE 10

DFS Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 MapReduce Data loader Jobs executor Query processing Query

SQL-On-MapReduce

10

slide-11
SLIDE 11

DFS Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 MapReduce Data loader Jobs executor Query processing Query

SQL-On-MapReduce

11

SQL Completeness (L, M, H suggests range of support) SQL-on-Hadoop Technology Operations SQL DDL/DML Packaged Analytic functions UDFs/Custom functions Hive M M L H HadoopDB/Hadapt M M L L Drill M L

  • L

Impala M L

  • L

Presto M

  • L
  • Spark/Shark

M M

  • H
slide-12
SLIDE 12

Benchmark – Why ?

http://spark.apache.org/ http://blog.cloudera.com/blog/2014/01/impala-performance- dbms-class-speed/

12

slide-13
SLIDE 13

Benchmark – What?

  • A set of resources and methods (situations) allowing to compare

efficiency of systems and approaches offering the same functionality

  • Resources: Data sets, Queries, Parameters, Machines, …
  • Methods: indexed data vs. non-indexed, Smart partitioning vs. Simple

partitioning, selective queries vs. non selective queries, increasing the number of machines, compressed data, …

13

slide-14
SLIDE 14

Data sets

DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB PT 1.2 250 GB

14

slide-15
SLIDE 15

Data sets

DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB Table #attributes #records #size Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp

  • sure_Metadata

6 41 (m) 16 GB PT 1.2 250 GB Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB

15

slide-16
SLIDE 16

Data sets

DataSet Source: #records Source: Size (GB) Object: #records Object: Size (GB)

PT 1.2 250 GB 325 (m) 236 9 (m) 14 PT 1.2 500 GB 650 (m) 472 18 (m) 28 PT 1.2 1 TB 1.3 (b) 944 36 (m) 56 PT 1.2 2 TB 2.6 (b) 1888 72 (m) 112

DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB Table #attributes #records #size Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp

  • sure_Metadata

6 41 (m) 16 GB

DataSet SourceID ObjectID DECL RA scienceCcd ExposureId D250GB 325 (m) 9 (m) 325 (m) 162 (m) 84 (k) D500GB 650 (m) 18 (m) 650 (m) 162 (m) 84 (k) D1TB 1.3 (b) 36 (m) 1,1 (b) 162 (m) 84 (k) D2TB 2.6 (b) 72 (m) 2,3 (b) 162 (m) 84 (k)

PT 1.2 250 GB

+

Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB

16

slide-17
SLIDE 17

Queries(1/2)

id SQL syntaxe Q1 select * from source where sourceid=29785473054213321; Q2 select sourceid, ra,decl from source where objectid=402386896042823; Q3 select sourceid, objectid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q4 select sourceid, ra,decl from source where scienceccdexposureid=454490250461; Q5 select objectid,count(sourceid) from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2 group by objectid; Q6 select objectid,count(sourceid) from source group by objectid; Q7 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q8 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96; Q9 SELECT s.psfFlux, s.psfFluxSigma, sce.exposureType FROM Source s JOIN RefSrcMatch rsm ON (s.sourceId = rsm.sourceId) JOIN Science_Ccd_Exposure_Metadata sce ON (s.scienceCcdExposureId = sce.scienceCcdExposureId) WHERE s.ra > 359.959 and s.ra < 359.96 and s.decl < 2.05 and s.decl > 2 and s.filterId = 2 and rsm.refObjectId is not NULL; Selection Group By Join

17

slide-18
SLIDE 18

Queries (1/2)

id SQL syntaxe Q10 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2

  • rder by objectid;

Q11 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 order by objectid; Q12 select id FROM rundeepforcedsource where areaspec_box(coord_ra,coord_decl,-55,-2,55,2)=1; Q13 select fluxToAbMag(flux_naive) from rundeepforcedsource where objectid=1398583353936135; Order by UDF

18

slide-19
SLIDE 19

Predicates selectivity

DataSet SourceID =id ObjectID =id 2<DECL<2.05 359.959 <ra < 359.96 359.959 <ra < 359.96 and 2 <decl < 2.05 scienceCcdExposureId D250GB 1 43 1,6 (m) 14 (k) 21 3.6 (k) D500GB 1 43 3,3 (m) 28 (k) 43 7.3 (k) D1TB 1 43 6,6 (m) 57 (k) 86 14.6 (k) D2TB 1 43 13,2 (m) 127 (k) 172 29,2 (k)

19

slide-20
SLIDE 20

Experiments’ environment

  • Material:
  • 10 machines DELL C6220
  • 2 clusters of 25 and 50 virtual machines (8 GB RAM, 2 cores and 300 GB of

disk space)

  • PT 1.2: 250 GB, 500 GB, 1 TB et 2 TB
  • Systems: Hive and HadoopDB

20

Available resources RAM 1 TB Disk space 52 TB #Processors 240

slide-21
SLIDE 21

Hive and HadoopDB

21

Map/reduce Map/reduce Map/reduce

HDFS HIVE Map/reduce Map/reduce Map/reduce HDFS

HIVE* (modified)

RDBMS RDBMS

HadoopDB Hive

Job 1 Job 2 Job 3 Job 4 Job 5 Job 6 Store result

  • f whole job
slide-22
SLIDE 22

Parameters to be measured

  • Tuning: Data loading time (indexing, partitioning, ...)
  • Performance: Total query execution time
  • Fault tolerance: Number of faults handled by the tool
  • Latency: time needed to have the first response
  • Situations
  • Scalability (data volume):
  • 250 GB  500 GB 1 TB  2 TB
  • Hardware Evolution:
  • 25 machines  50 machines
  • Indexed data Vs. non-indexed data
  • Different partitioning schemas

22

slide-23
SLIDE 23

Lessons learned (1/8)

  • Hive vs HadoopDB
  • HadoopDB offers a Custom Partitioning
  • More data => more time to change data partitioning schema

(scalability)

  • More hardware (machines) => less time to partition the data

(hardware speed up)

23

200 400 600 800 1000 1200 1400 1600 1800 2000 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines

minutes

Hive (HDFS) global hash local hash tuning

slide-24
SLIDE 24

Lessons learned (2/8)

  • 1st strategy (Hive): Global and distributed index
  • Hive: index size (GB) vs. index creation time (Seconds)
  • More distinct values ​for an attribute => more space to store the index (e.g., Decl)
  • We can take advantage from hardware to accelerate the creation of indexes (50 Vs 25 machines

machines)

  • Creation time: 8 mins - 2 hours for 25 machines and 5 mins - 1h 30 for 50 machines

24

1000 2000 3000 4000 5000 6000 7000 8000 250GB 500 GB 1TB 2 TB 250GB 500 GB 1TB 2 TB 25 machines 50 machines

Time (seconds)

Index creation

Sourceid

  • bjectid

ra decl scienceccdexposure 50 100 150 200 250GB 500 GB 1TB 2 TB

Size (GB)

Index size

Sourceid

  • bjectid

ra decl scienceccdexposure

slide-25
SLIDE 25

Lessons learned (3/8)

  • 2nd strategy (HadoopDB): each node manages indexes of data it hosts
  • Index creation time Vs the total data loading time
  • Index creation task can can be parallelized which allow us to take advantage of available resources
  • We can take advantages of available hardware to accelerate the creation time (50 Vs 25 machines

machines)

  • Creation time: ~ 10 minutes for 5 indexes

25

500 1000 1500 2000 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines

Index creation (minutes)

Load time indexation

slide-26
SLIDE 26

500 1000 1500 2000 HadoopDB Hive HadoopDB Hive HadoopDB Hive 250 GB 500 GB 1 TB

Time (seconds)

Q1 Q2 Q3 Q4 50 100 150 200 Hive HadoopDB Hive HadoopDB Hive HadoopDB 250 Gb 500 GB 1 TB

With Index

Q1 Q2 Q3 (RA) Q4

  • Performance for selection tasks (seconds)
  • 50 machines (400 GB of RAM)
  • Non-indexed data (max 30 mins) Vs indexed data (max 3 mins)
  • Without index: HadoopDB (+) vs Hive (-) except for the dataset with 250 GB
  • With Index: Hive (+) vs HadoopDB (-) except for Q3 queries and the data set with 1TB where Hive has failed to use

the index because the size of filtered part is greater than the available RAM (> 8GB)

  • Scaling:
  • Without Index: 250 GB (max 4 mins) Vs 500GB (max 10 mins) Vs 1 TB (max 30 mins)
  • With Index: 250 GB (max 2 mins) Vs 500GB (max 3 mins) Vs 1 TB (max 3.5 mins)

26

Lessons learned (4/8)

slide-27
SLIDE 27

500 1000 1500 2000 2500 3000 3500 4000 4500 Hive Hive wth Index HadoopDB HadoopDB wth index Hive Hive wth index HadoopDb HadoopDb wth index Hive Hive wth index HadoopDB HadoopDB wth index 250 go 500 go 1 To

Group by

Q5 Q6

  • Non-indexed data (max 1 hour 8 mins) Vs Indexed Data (max 3 mins)
  • Without index: HadoopDB (+) vs Hive (-) for selective queries (e.g., Q5) and HadoopDB (-) vs Hive (+) for non-selective queries (e.g.,

Q6)

  • With Index: Hive systematically denied the use of the index for non-selective queries, HadoopDB (+) vs Hive (-) for selective queries

and HadoopDB does not use the index for non-selective queries

  • Scalability:
  • Without Index: 250 GB (max 20 mins) Vs 500GB (max 40 mins) Vs 1 TB (max 1h10 mins)
  • With Index: 250 GB (max 15 mins) Vs 500GB (max 35 mins) Vs 1 TB (max 1H8 mins)
  • Hive parallelizes group by processing => Hive without index is significantly better than HadoopDB with Index

27

Lessons learned (5/8)

slide-28
SLIDE 28
  • Optimization with HadoopDB: Change partitioning attribute
  • Minimization of network cost (number of transferred records from
  • ne machine to another)
  • A gain up to 500% of the execution time

28

500 1000 1500 2000 2500 3000 3500 4000 4500 HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index HadoopDb HadoopDb wth index HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index SourceID ObjectID SourceID ObjectID SourceID ObjectID 250 go 500 go 1 To

HadoopDB with a custom partitioning

Q5 Q6

Lessons learned (6/8)

slide-29
SLIDE 29
  • Performance with join tasks (seconds)
  • HadoopDB does not support joins with more than 2 tables
  • Hive does not use indexes to evaluate queries with joins
  • Indexed data (>> 2h) vs. non-indexed data (>>> 2h)
  • Hive is significantly better than HadoopDB even when HadoopDB uses indexes
  • Hive parallelize the processing of joins
  • HadoopDB data collects and make a join using only a single machine
  • HadoopDB does not use the index for non-selective queries

29

1000 2000 3000 4000 5000 6000 Hive HadoopDB HadoopDB wth index Hive HadoopDB HadoopDB wth index Hive HadoopDB HadoopDB wth index 250 GB 500 GB 1 TB

Time (seconds)

Join tasks

Q7 Q8 Q9

Lessons learned (7/8)

slide-30
SLIDE 30
  • ORDER BY tasks performances (seconds)
  • 50 machines (400 GB of RAM)
  • Hive does not allow the use of an index to evaluate ORDER BY queries
  • Non-indexed data (max 30 mins) Vs indexed data (max 1 min)
  • Hive Vs HadoopDB
  • HadoopDB is significantly better than Hive with and without indexes
  • The ORDER BY is not a paralilisable task within MapReduce model

30

500 1000 1500 2000 Hive Hive wth index HadoopDB HadoopDB index Hive Hive wth index HadoopDB HadoopDB index Hive Hive wth index HadoopDB HadoopDB index 250 Go 500 Go 1 To

Order by tasks

Q10 Q11

Lessons learned (8/8)

slide-31
SLIDE 31

Summary

  • Cost (q)= 𝑙=1

𝑜−1 𝐷𝑝𝑡𝑢 𝑘𝑝𝑐𝑙 + 𝑑𝑝𝑡𝑢(𝑘𝑝𝑐𝑜) − 𝑑𝑝𝑡𝑢𝑋𝑠(𝑘𝑝𝑐𝑜)

  • Cost(job)=cost(Map)+cost(Network)+cost(Reduce)+cost(write results)
  • Optimizations:
  • N: jobs
  • Cost (Map): indexing, compression, read data using many « Mappers »
  • Cost (Network): #<key, values> minimization, intermediate results, . . .
  • Partitioning
  • Cost (Reduce): parallel processing (# Reducers), Minimizing the size of intermediate

results

  • http://com.isima.fr/Petasky/hive-vs-hadoopdb
  • http://petasky.univ-lyon1.fr/scale/

31