Scientific data management and analysis – Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST)
Amin Mesmoudi
1
Scientific data management and analysis Data acquired fr from the - - PowerPoint PPT Presentation
Scientific data management and analysis Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST) Amin Mesmoudi 1 Context CNRS-Mastodons 2012- LSST Project 2 LSST needs in storage and data access (1/2) Storage:
1
2
3
Table Size #Records #Attributes Object 109 TB 38 B 470 Moving Object 5 GB 6 M 100 Source 3.6 PB 5 T 125 Forced Source 1.1 PB 32 T 7 Difference Image Source 71 TB 200 B 65 CCD Exposure 0.6 TB 17 B 45
1 trillion=1018
4
SELECT objectId, taiMidPoint, fluxToAbMag(psfMag) FROM Source JOIN IN Object USING(objectId) JOIN Filter USING(filterId) WHERE areaSpec_box(:raMin, :declMin, :raMax, :declMax) AND filterName = 'u' AND variability BETWEEN :varMin AND :varMax ORDER R BY objectId, taiMidPoint ASC
5
6
map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value)
7
SourceID ObjectID RA DECL ExposureID 1 1 10
1 5 2 15
1 9 2 20
5 13 4 25
1 2 1 40
2 6 2 45
2 10 3 50
2 14 4 55
3 3 1 60
3 7 2 65
3 11 3 70
3 15 3 75
4 4 1 80
4 8 2 85
4 12 3 90
5 16 4 95
6 ObjectID freq 1 4 2 5 3 4 4 3
8
SourceID ObjectID RA DECL ExposureID 1 1 10
1 5 2 15
1 9 2 20
5 13 4 25
1 SourceID ObjectID RA DECL ExposureID 2 1 40
2 6 2 45
2 10 3 50
2 14 4 55
3 SourceID ObjectID RA DECL ExposureID 3 1 60
3 7 2 65
3 11 3 70
3 15 3 75
4 SourceID ObjectID RA DECL ExposureID 4 1 80
4 8 2 85
4 12 3 90
5 16 4 95
6
Node 01 Node 02 Node 03 Node 04 Mapper Shuffle and Sort Combiner
1 1 2 1 2 1 4 1
Mapper Mapper Mapper Combiner Combiner Combiner
Partitioner Partitioner Partitioner Partitioner
Reducer Reducer Reducer
1 1 2 1 3 1 4 1 1 1 2 1 3 1 3 1 1 1 2 1 3 1 4 1 1 1 2 2 4 1 1 1 2 1 3 1 4 1 1 1 2 1 3 2 1 1 2 1 3 1 4 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 2 1 3 1 3 2 3 1 4 1 4 1 4 1 1 4 2 5 3 4 4 3 ObjectID freq 1 4 2 5 3 4 4 3
9
ObjectID freq ObjectID freq ObjectID freq ObjectID freq
DFS Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 MapReduce Data loader Jobs executor Query processing Query
10
DFS Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 MapReduce Data loader Jobs executor Query processing Query
11
SQL Completeness (L, M, H suggests range of support) SQL-on-Hadoop Technology Operations SQL DDL/DML Packaged Analytic functions UDFs/Custom functions Hive M M L H HadoopDB/Hadapt M M L L Drill M L
Impala M L
Presto M
M M
http://spark.apache.org/ http://blog.cloudera.com/blog/2014/01/impala-performance- dbms-class-speed/
12
13
DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB PT 1.2 250 GB
14
DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB Table #attributes #records #size Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp
6 41 (m) 16 GB PT 1.2 250 GB Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB
15
DataSet Source: #records Source: Size (GB) Object: #records Object: Size (GB)
PT 1.2 250 GB 325 (m) 236 9 (m) 14 PT 1.2 500 GB 650 (m) 472 18 (m) 28 PT 1.2 1 TB 1.3 (b) 944 36 (m) 56 PT 1.2 2 TB 2.6 (b) 1888 72 (m) 112
DataSet Size PT 1.1 90 GB PT 1.2 250 GB Winter13 3 TB SDSS 3 TB Table #attributes #records #size Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp
6 41 (m) 16 GB
DataSet SourceID ObjectID DECL RA scienceCcd ExposureId D250GB 325 (m) 9 (m) 325 (m) 162 (m) 84 (k) D500GB 650 (m) 18 (m) 650 (m) 162 (m) 84 (k) D1TB 1.3 (b) 36 (m) 1,1 (b) 162 (m) 84 (k) D2TB 2.6 (b) 72 (m) 2,3 (b) 162 (m) 84 (k)
PT 1.2 250 GB
+
Source 107 162 (m) 118 GB Object 227 4 (m) 7 GB
16
id SQL syntaxe Q1 select * from source where sourceid=29785473054213321; Q2 select sourceid, ra,decl from source where objectid=402386896042823; Q3 select sourceid, objectid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q4 select sourceid, ra,decl from source where scienceccdexposureid=454490250461; Q5 select objectid,count(sourceid) from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2 group by objectid; Q6 select objectid,count(sourceid) from source group by objectid; Q7 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q8 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96; Q9 SELECT s.psfFlux, s.psfFluxSigma, sce.exposureType FROM Source s JOIN RefSrcMatch rsm ON (s.sourceId = rsm.sourceId) JOIN Science_Ccd_Exposure_Metadata sce ON (s.scienceCcdExposureId = sce.scienceCcdExposureId) WHERE s.ra > 359.959 and s.ra < 359.96 and s.decl < 2.05 and s.decl > 2 and s.filterId = 2 and rsm.refObjectId is not NULL; Selection Group By Join
17
id SQL syntaxe Q10 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2
Q11 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 order by objectid; Q12 select id FROM rundeepforcedsource where areaspec_box(coord_ra,coord_decl,-55,-2,55,2)=1; Q13 select fluxToAbMag(flux_naive) from rundeepforcedsource where objectid=1398583353936135; Order by UDF
18
DataSet SourceID =id ObjectID =id 2<DECL<2.05 359.959 <ra < 359.96 359.959 <ra < 359.96 and 2 <decl < 2.05 scienceCcdExposureId D250GB 1 43 1,6 (m) 14 (k) 21 3.6 (k) D500GB 1 43 3,3 (m) 28 (k) 43 7.3 (k) D1TB 1 43 6,6 (m) 57 (k) 86 14.6 (k) D2TB 1 43 13,2 (m) 127 (k) 172 29,2 (k)
19
20
Available resources RAM 1 TB Disk space 52 TB #Processors 240
21
Map/reduce Map/reduce Map/reduce
HDFS HIVE Map/reduce Map/reduce Map/reduce HDFS
HIVE* (modified)
RDBMS RDBMS
HadoopDB Hive
Job 1 Job 2 Job 3 Job 4 Job 5 Job 6 Store result
22
23
200 400 600 800 1000 1200 1400 1600 1800 2000 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines
minutes
Hive (HDFS) global hash local hash tuning
machines)
24
1000 2000 3000 4000 5000 6000 7000 8000 250GB 500 GB 1TB 2 TB 250GB 500 GB 1TB 2 TB 25 machines 50 machines
Time (seconds)
Index creation
Sourceid
ra decl scienceccdexposure 50 100 150 200 250GB 500 GB 1TB 2 TB
Size (GB)
Index size
Sourceid
ra decl scienceccdexposure
machines)
25
500 1000 1500 2000 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines
Index creation (minutes)
Load time indexation
500 1000 1500 2000 HadoopDB Hive HadoopDB Hive HadoopDB Hive 250 GB 500 GB 1 TB
Time (seconds)
Q1 Q2 Q3 Q4 50 100 150 200 Hive HadoopDB Hive HadoopDB Hive HadoopDB 250 Gb 500 GB 1 TB
With Index
Q1 Q2 Q3 (RA) Q4
the index because the size of filtered part is greater than the available RAM (> 8GB)
26
500 1000 1500 2000 2500 3000 3500 4000 4500 Hive Hive wth Index HadoopDB HadoopDB wth index Hive Hive wth index HadoopDb HadoopDb wth index Hive Hive wth index HadoopDB HadoopDB wth index 250 go 500 go 1 To
Group by
Q5 Q6
Q6)
and HadoopDB does not use the index for non-selective queries
27
28
500 1000 1500 2000 2500 3000 3500 4000 4500 HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index HadoopDb HadoopDb wth index HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index HadoopDB HadoopDB wth index SourceID ObjectID SourceID ObjectID SourceID ObjectID 250 go 500 go 1 To
HadoopDB with a custom partitioning
Q5 Q6
29
1000 2000 3000 4000 5000 6000 Hive HadoopDB HadoopDB wth index Hive HadoopDB HadoopDB wth index Hive HadoopDB HadoopDB wth index 250 GB 500 GB 1 TB
Time (seconds)
Join tasks
Q7 Q8 Q9
30
500 1000 1500 2000 Hive Hive wth index HadoopDB HadoopDB index Hive Hive wth index HadoopDB HadoopDB index Hive Hive wth index HadoopDB HadoopDB index 250 Go 500 Go 1 To
Order by tasks
Q10 Q11
𝑜−1 𝐷𝑝𝑡𝑢 𝑘𝑝𝑐𝑙 + 𝑑𝑝𝑡𝑢(𝑘𝑝𝑐𝑜) − 𝑑𝑝𝑡𝑢𝑋𝑠(𝑘𝑝𝑐𝑜)
results
31