Array Databases http://www.faculty.jacobs-university.de/pbaumann - - PowerPoint PPT Presentation

array databases
SMART_READER_LITE
LIVE PREVIEW

Array Databases http://www.faculty.jacobs-university.de/pbaumann - - PowerPoint PPT Presentation

Array Databases http://www.faculty.jacobs-university.de/pbaumann publications http://en.wikipedia.org/wiki/Array_DBMS [animation: gamingfeeds.com] 320302 Databases & WebServices (P. Baumann) 1 Who Needs Arrays? Sensor, image,


slide-1
SLIDE 1

1 320302 Databases & WebServices (P. Baumann)

Array Databases

http://www.faculty.jacobs-university.de/pbaumann  publications http://en.wikipedia.org/wiki/Array_DBMS

[animation: gamingfeeds.com]

slide-2
SLIDE 2

2 320302 Databases & WebServices (P. Baumann)

  • Sensor, image, simulation, statistics data
  • Earth: Geodesy, geology, hydrology, oceanography, climate, earth system, ...
  • Space: optical / radio astronomy, cosmological simulation, planetary science, ...
  • Life: Pharma/chem, healthcare / bio research, bio statistics, genetics, ...
  • Engineering & research: Simulation & experimental data in automotive/shipbuilding/

aerospace industry, turbines, process industry, ...

  • Management/Controlling: Decision Support, OLAP, Data Warehousing, census,

statistics in industry and public administration, ...

  • Multimedia: distance learning, prepress, ...
  • „80% of all data have some spatial connotation“ [C&P Hane, 1992]

Who Needs Arrays?

slide-3
SLIDE 3

3 320302 Databases & WebServices (P. Baumann)

Arrays in [Geo] Science & Engineering

  • spatio-temporal sensor, image, simulation, statistics data(cubes)

[OGC SWE]

sensor feeds

Big Data server

slide-4
SLIDE 4

4 320302 Databases & WebServices (P. Baumann)

CONCEPTUAL MODELLING

slide-5
SLIDE 5

5 320302 Databases & WebServices (P. Baumann)

The Array Data Model

spatial domain dimension 24 21 23 22 42 7 8 5 6 4 21 lower bound 24 upper bound 8 4 25 30 cell value cell

slide-6
SLIDE 6

6 320302 Databases & WebServices (P. Baumann)

Array Analytics

  • Array Analytics :=

Efficient analysis on multi-dimensional arrays of a size several orders of magnitude above evaluation engine‘s main memory

  • Essential data property: n-dimensional Euclidean neighborhood
  • Secondary: #dimensions, density, ...
  • Operations: signal/image processing,

Linear Algebra [M. Stonebraker], iterations

slide-7
SLIDE 7

7 320302 Databases & WebServices (P. Baumann)

Let’s Take a Closer Look...

  • Divergent access patterns for ingest and retrieval
  • Server must mediate between access patterns

t

slide-8
SLIDE 8

8 320302 Databases & WebServices (P. Baumann)

  • „raster data manager“: SQL + n-D arrays
  • Scalable parallel “tile streaming” architecture
  • [VLDB 1994, VLDB 1997, SIGMOD 1998, VLDB

2003, …, VLDB 2016]

  • Blueprint for stds, in operational use
  • 250 TB PB

rasdaman

slide-9
SLIDE 9

9 320302 Databases & WebServices (P. Baumann)

Array Embedding

  • Goal: integration of arrays with relational model
  • tables of typed n-D arrays
  • Original rasql: Array + system attribute OID
  • „collections“ = binary relations (oid,array)
  • In hindsight, bad tuple access design:

array like tuple variable, oid via function

  • In future: ISO SQL/MDA

(Multi-Dimensional Arrays)

  • Arrays as another „attribute type“
  • Under finalization in ISO

MyColl array

  • id 1
  • id 2
  • id 3
  • id 4
  • id 5

OID

MyData att 2 att 1 att n key1 ...

  • id 1

key2 ...

  • id 2

key3 ...

  • id 3

select img[ 100:199, 100:199 ] from MyColl as m where oid(m) = 42

slide-10
SLIDE 10

10 320302 Databases & WebServices (P. Baumann)

  • selection & subsetting

  • result processing

The rasql Query Language

  • search & aggregation
  • data format conversion

select c[ *:*, 100:200, *:*, 42 ] from ClimateSimulations as c select img * (img.green > 130) from LandsatArchive as img select mri from MRI as img, masks as am where some_cells( mri > 250 and m ) select encode( c[*:*,*:*,100,42], „png“ ) from ClimateSimulations as c

rasdaman DB HDF PNG NetCDF HDF PNG NetCDF

slide-11
SLIDE 11

11 320302 Databases & WebServices (P. Baumann)

Visual Database Interaction

select encode( struct { red: (char) s.image.b7[x0:x1,x0:x1], green: (char) s.image.b5[x0:x1,x0:x1], blue: (char) s.image.b0[x0:x1,x0:x1], alpha: (char) scale( d.elev, 20 ) }, "image/png" ) from SatImage as s, DEM as d

[JacobsU, Fraunhofer; data courtesy BGS, ESA]

slide-12
SLIDE 12

12 320302 Databases & WebServices (P. Baumann)

Linear Algebra Ops

  • Matrix multiplication
  • Histogram

select marray bucket in [0:255] values count_cells( img = bucket ) from img select marray i in [0:m], j in [0:p] values condense +

  • ver k in [0:n]

using a [ i, k ] * b [ k, j ] from matrix as a, matrix as b

slide-13
SLIDE 13

13 320302 Databases & WebServices (P. Baumann)

Arrays in SQL

[SSDBM 2014]

select id, encode(scene.band1-scene.band2)/(scene.band1+scene.band2)), „image/tiff“ ) from LandsatScenes where acquired between „1990-06-01“ and „1990-06-30“ and avg( scene.band3-scene.band4)/(scene.band3+scene.band4)) > 0 create table LandsatScenes( id: integer not null, acquired: date, scene: row( band1: integer, ..., band7: integer ) mdarray [ 0:4999,0:4999] )

slide-14
SLIDE 14

14 320302 Databases & WebServices (P. Baumann)

ARCHITECTURE

slide-15
SLIDE 15

15 320302 Databases & WebServices (P. Baumann)

Storage Mapping: Variants

  • Coordinate-free sequence
  • BLOB (binary large object)
  • Costs mainly position/dimension dependent
  • ooooooooooooooooooooooXXXXXXXX
  • ooooooooooooooooooooooXXXXXXXX
  • ooooooooooooooooooooooXXXXXXXX
  • ooooooooooooooooooooooXXXXXXXXoooooooooooooooooooooooXXXXXXXXoooooooooooooooooooooooXXXXXXXXooooooooooooooooooooooooooooo
  • ooooooXXXXXoooooooooooooooooXXoooXooooo
  • Sequence independent, coordinates explicit
  • ROLAP
  • Costs not position correlated, but high
  • Imaging, multidimensional OLAP
  • Partitioning, sequence within partition
  • Costs low for bulk access, usually not location correlated

{ (x1,f1), (x2,f2), ..., (xn,fn) }

slide-16
SLIDE 16

16 320302 Databases & WebServices (P. Baumann)

  • Goal: faster tile loading by adapting storage units to access patterns
  • Approach: partition n-D array into n-D partitions („tiles“)
  • Tiling classification based on degree of alignment [ICDE 1999]

Datacube Partitioning

regular irregular partially aligned totally nonaligned nonaligned aligned

chunking [Sarawagi, Stonebraker, DeWitt, ... ]

slide-17
SLIDE 17

17 320302 Databases & WebServices (P. Baumann)

Why Irregular Tiling?

  • e-Science often uses irregular partioning

[OpenStreetMap] [Centrella et al: scidacreviews.org]

slide-18
SLIDE 18

18 320302 Databases & WebServices (P. Baumann)

Tiling as a Tuning Parameter

  • tiling strategies [ICDE 1999]:

regular directional ... area of interest

  • storage layout language [SSTDM 2010]

insert into MyCollection values ... tiling area of interest [0:20,0:40], [45:80,80:85] tile size 1000000 index d_index storage array compression zlib

slide-19
SLIDE 19

19 320302 Databases & WebServices (P. Baumann)

Query Processing

  • Clear separation:

set vs array trees

  • Arrays as 2nd order attributes
  • Extensive optimization
  • Tile-based evaluation

select a < sum_cells( b + c ) from a, b, c <ind sum +ind c b a

slide-20
SLIDE 20

20 320302 Databases & WebServices (P. Baumann)

Array Joins

  • „A θ B“ in presence of partitioned arrays A, B
  • Challenge: partitions shifted, different size, heterogeneous
  • inefficient multiple reads of sub-arrays
  • Goal: optimal partition loading sequence
  • Approach: bi-partite graph traversal
  • Also useful for buffer mgmt, parallelization
slide-21
SLIDE 21

21 320302 Databases & WebServices (P. Baumann)

Query Optimization

select avg_cells( a + b ) from a, b select avg_cells( a ) + avg_cells( b ) from a, b

avg + a avg b avg +ind b a

Tile stream high traffic Scalar stream low traffic [Ritsch 2000]

slide-22
SLIDE 22

22 320302 Databases & WebServices (P. Baumann)

select jpeg( scale(bild0[...],[1:300,1:300]) * { 1c, 1c, 1c}

  • verlay ((scale(bild1[...],[1:300,1:300])<71.0)) * {51c, 153c, 255c }
  • verlay bit(scale(bild2[...],[1:300,1:300]), 2) * {230c, 230c, 204c}
  • verlay bit(scale(bild2[...],[1:300,1:300]), 5) * {1c, 1c, 1c}
  • verlay bit(scale(bild2[...],[1:300,1:300]), 7) * {102c, 102c, 102c}
  • verlay bit(scale(bild2[...],[1:300,1:300]), 6) * {255c, 255c, 0c}
  • verlay bit(scale(bild2[...],[1:300,1:300]), 3) * {191c, 242c, 128c}
  • verlay bit(scale(bild2[...],[1:300,1:300]), 4) * {191c, 255c, 255c}
  • verlay bit(scale(bild2[...],[1:300,1:300]), 1) * {0c, 255c, 255c}
  • verlay bit(scale(bild2[...],[1:300,1:300]), 0) * {102c, 102c, 102c} )

from ...

Optimisation Does Pay Off!

  • Complex queries give more space to optimizer
  • Typical OGC Web Map Service query:
slide-23
SLIDE 23

23 320302 Databases & WebServices (P. Baumann)

select max((A.nir - A.red) / (A.nir + A.red))

  • max((B.nir - B.red) / (B.nir + B.red))
  • max((C.nir - C.red) / (C.nir + C.red))
  • max((D.nir - D.red) / (D.nir + D.red))

from A, B, C, D

Parallel / Distributed Query Processing

1 query  1,000+ cloud nodes [ACM SIGMOD DANAC 2014]

Dataset B Dataset A Dataset D Dataset C

slide-24
SLIDE 24

24 320302 Databases & WebServices (P. Baumann)

external files rasserver database File system rasdaman geo services Web clients (m2m, browser)

Architecture

distributed query processing

No single point of failure

Internet

alternative storage [SSTD 2013]

rasfed

federation demon tile access

slide-25
SLIDE 25

25 320302 Databases & WebServices (P. Baumann)

APPLICATIONS

slide-26
SLIDE 26

26 320302 Databases & WebServices (P. Baumann)

EarthServer

  • Agile Analytics on x/y/t + x/y/z/t Earth & Planetary datacubes
  • EU rasdaman + US NASA WorldWind
  • Rigorously standards as c/s APIs
  • 2.5+ Petabyte
  • Intercontinental initiative, 3+3 years:

EU + US + AUS

www.earthserver.eu

26 320302 Databases & WebService (P. Baumann)

slide-27
SLIDE 27

27 320302 Databases & WebServices (P. Baumann)

OGC WCPS: Analyzing Datacubes

  • Web Coverage Processing Service (WCPS)

= spatio-temporal datacube analytics language

for $c in ( M1, M2, M3 ) where some( $c.nir > 127 ) return encode( $c.red - $c.nir, “image/tiff“ )

  • "From MODIS scenes M1, M2, M3: difference red & nir, as TIFF"
  • “…but only those where nir exceeds 127 somewhere”
slide-28
SLIDE 28

28 320302 Databases & WebServices (P. Baumann)

MEA: Land Surface Temperature, Cloudfree

slide-29
SLIDE 29

29 320302 Databases & WebServices (P. Baumann)

ECMWF: River Discharge

slide-30
SLIDE 30

30 320302 Databases & WebServices (P. Baumann)

MEA: Daily Hydro Estimator

slide-31
SLIDE 31

31 320302 Databases & WebServices (P. Baumann)

British Geological Service

[BGS 2013]

slide-32
SLIDE 32

32 320302 Databases & WebServices (P. Baumann)

Cosmological Simulation

  • Modelling domain: 4D
  • Dark matter (highest mass factor in universe)
  • Baryonic matter (stars, gas, dust, …)
  •  Coupled simulation: particle + fluid
  • Results: 3D/4D cutouts from universe
  • Eg, 64 Mpc3

(1 pc = 3.27 light years)

  • Screenshots: AstroMD

[Gheller, Rossi 2001]

slide-33
SLIDE 33

33 320302 Databases & WebServices (P. Baumann)

Gene Expression Analysis

  • Gene expression = reading out genes for reproduction
  • Research goal: capture spatio-temporal expression patterns in Drosophila

genes

select encode( scale( {1c,0c,0c}*e[0,*:*,*:*] +{0c,1c,0c}*e[1,*:*,*:*] +{0c,0c,1c}*e[2,*:*,*:*], 0.2 ), „image/jpeg“ ) from EmbryoImages as e where oid(e)=193537 http://urchin.spbcas.ru/Mooshka/ [Samsonova et al]

slide-34
SLIDE 34

34 320302 Databases & WebServices (P. Baumann)

select tiff( ht[ $1, *:*, *:* ] ) from HeadTomograms as ht, Hippocampus as mask where count_cells( ht > $2 and mask ) / count_cells( mask ) > $3

Human Brain Imaging

  • Research goal: to understand structural-functional relations in human brain
  • Experiments capture activity patterns (PET, fMRI)
  • Temperature, electrical, oxygen consumption, ...
  • lots of computations

„activation maps“

  • Example: “a parasagittal view of all scans containing

critical Hippocampus activations, TIFF-coded.“

$1 = slicing position, $2 = intensity threshold value, $3 = confidence

slide-35
SLIDE 35

35 320302 Databases & WebServices (P. Baumann)

  • Geo
  • Environmental sensor data, 1-D
  • Satellite / seafloor maps, 2-D
  • Geophysics (3-D x/y/z)
  • Climate modelling (4-D, x/y/z/t)
  • Life science
  • Gene expression simulation (3-D)
  • Human brain imaging (3-D / 4-D)
  • Other
  • Computational Fluid Dynamics (3-D)
  • Astrophysics (4-D)
  • Statistics (n-D)

Domains Investigated

slide-36
SLIDE 36

36 320302 Databases & WebServices (P. Baumann)

WRAP-UP

slide-37
SLIDE 37

37 320302 Databases & WebServices (P. Baumann)

Early History of Array Databases

See also: RDA Array Database Asssessment WG

slide-38
SLIDE 38

38 320302 Databases & WebServices (P. Baumann)

Early 3-D Service on rasdaman

[Diedrich et al 2001]

slide-39
SLIDE 39

39 320302 Databases & WebServices (P. Baumann)

Summary

  • Arrays are core data structure next to sets, graphs, hierarchies
  • sensor, image, simulation, statistics datacubes
  • Array DBMS for declarative queries on massive n-D arrays
  • rasdaman
  • Issues:
  • enhancing distributed processing
  • iterative methods
  • ...