Capturing the Laws of (Data) Nature Hannes Mhleisen, Martin - - PowerPoint PPT Presentation

capturing the laws of data nature
SMART_READER_LITE
LIVE PREVIEW

Capturing the Laws of (Data) Nature Hannes Mhleisen, Martin - - PowerPoint PPT Presentation

Capturing the Laws of (Data) Nature Hannes Mhleisen, Martin Kersten & Stefan Manegold CIDR 2015 Statistical Model Fitting & DB? User gave me a model, lets see. I am storing some data. I need some of the observations to fit


slide-1
SLIDE 1

Capturing the 
 Laws of (Data) Nature

Hannes Mühleisen, Martin Kersten & Stefan Manegold CIDR 2015

slide-2
SLIDE 2

Statistical Model Fitting & DB?

slide-3
SLIDE 3

I am storing some data. User gave me a model, let’s see. I need some of the observations to fit the model. This other guy is reading some of my data. Cool, the model seems to fit the data well! Let’s get some more data to validate the fit… This other guy is reading some more of my data. Amazing, model fit is validated. I am storing some data. Beer!

Database Stats

slide-4
SLIDE 4

The point?

  • Everyone has models, they encode our

understanding of the world

  • Everyone has data to train/fit and validate a model
  • So far, data management community has ignored

these models

  • But they hold precious domain knowledge!
slide-5
SLIDE 5

LOFAR Example

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Configuration Measurement

slide-9
SLIDE 9

Model! Grouped by-source operation Convergence Hints

slide-10
SLIDE 10

Measurement Configuration Fitted parameters

slide-11
SLIDE 11
slide-12
SLIDE 12
  • 0.10

0.12 0.14 0.16 0.18 0.20 2.0 2.5 3.0 3.5 Frequency (GHz) Intensity (Jy)

source=17562, alpha=-0.692, p=0.812

slide-13
SLIDE 13

Exploit!

slide-14
SLIDE 14

Model to function conversion (automatic) Move to DB (automatic)

slide-15
SLIDE 15

Approximate Answer with zero IO*

slide-16
SLIDE 16

But…

  • What do we do if model parameters are not

specified in the query?

  • Sample data?
  • Given multiple parameters, it is far from certain that

all combinations of values are allowed in the model.

  • Construct filter?
slide-17
SLIDE 17

Flux Flux Residuals Ratio ORIG 11,665,408 11,665,408 0% GZIP 4,331,782 3,748,872 86% BZIP2 3,341,574 2,752,044 82% XZ 2,887,584 2,727,144 94%

“Semantic” Compression

Drop residuals = lossy compression =

slide-18
SLIDE 18

Data & Model Changes

  • What should we do if the user gives us a better

model?

  • Recompressing could be very expensive
  • Threshold for improvement?
  • Changes in the data affect the model quality, too
  • Switch models?
  • Constant Monitoring?
slide-19
SLIDE 19

Multiple, partial or grouped

  • There could be many models for a table with
  • verlapping parameters
  • Which one to pick?
  • Models do not have to cover the entire table/column
  • “Patching”?
  • Models could be fitted on aggregation results
  • Keep group counts?
slide-20
SLIDE 20

How do we get our hands on Models?

slide-21
SLIDE 21
slide-22
SLIDE 22

Integrate & Intercept

  • Integrate model fitting infrastructure into data

management system.

  • Also: Huge performance benefits for analysts!
  • Intercept model fitting and validation operations by

the user and store the model for later use.

  • Storage format: Model code + Parameters
slide-23
SLIDE 23

I ≈ p · να ? S ν I S ν I R2 = 0.92 ! I ≈ p · να ? R2 = 0.92 ! S p α I ≈ p · να S = 42, ν = 0.14, I =? I = 3.0 ± 0.05 ! (1) (2) (3) (4) (5)

slide-24
SLIDE 24

Questions?

http://hannes.muehleisen.org @hfmuehleisen

George E. P. Box

“Essentially, all models are wrong, but some are useful.”