A Study on Workload-Aware Wavelet Synopses for Point and Range Sum - - PowerPoint PPT Presentation

a study on workload aware wavelet synopses for point and
SMART_READER_LITE
LIVE PREVIEW

A Study on Workload-Aware Wavelet Synopses for Point and Range Sum - - PowerPoint PPT Presentation

A Study on Workload-Aware Wavelet Synopses for Point and Range Sum Queries Michael Mathioudakis , mathiou@cs.toronto.edu Dimitris Sacharidis, dsachar@dblab.ntua.gr Timos Sellis, timos@dblab.ntua.gr DOLAP 2006 Outline Introduction


slide-1
SLIDE 1

A Study on Workload-Aware Wavelet Synopses for Point and Range Sum Queries

Michael Mathioudakis, mathiou@cs.toronto.edu Dimitris Sacharidis, dsachar@dblab.ntua.gr Timos Sellis, timos@dblab.ntua.gr

DOLAP 2006

slide-2
SLIDE 2

Outline

  • Introduction
  • Wavelets
  • Error Metrics
  • Algorithms for Point Errors
  • Algorithms for Range Sum Errors
  • Experimental Results
slide-3
SLIDE 3

Introduction

  • Approximate Query Processing over Synopses:

An effective approach to manage large data sets (eg OLAP queries)

  • 1. Query optimization process - Provide highly accurate query

selectivity estimates

  • 2. Can be used instead of the actual data - Provide quick

approximate answers to large queries

  • Workload-Awareness:

Take user behavior under consideration - More accuracy for important data - workload aware synopses

  • Histograms, Wavelet Transformation :

Commonly Used Synopses construction techniques

slide-4
SLIDE 4

Introduction - Our Contribution

  • Focus on wavelet synopsis construction algorithms
  • Theoretical presentation of existing algorithms
  • Presentation of a novel workload-aware algorithm for range-

sum queries

  • Experimental study - Accuracy vs Time Efficiency
slide-5
SLIDE 5

Outline

  • Introduction
  • Wavelets
  • Error Metrics
  • Algorithms for Point Errors
  • Algorithms for Range Sum Errors
  • Experimental Results
slide-6
SLIDE 6

Wavelet Preliminaries

  • It’s a transformation!
  • Histograms: Construct Buckets on Initial Data - Assign one

value per bucket

!"

!#

!$ !% !& !' !( !) *" *# *$ *% *& *' *( *)

+,-.-!/01!.! 2!34/4.05647!*-4,.8 944:06,/;0<==>0670.?4@A

a2

a1

a8 a3 a4 a5 a6 a7

Initial Data

Bucket 1

Bucket 2 Bucket 3

slide-7
SLIDE 7

2 4 2 2 3 5 4

  • 5/4

1/2

  • 1
  • 1

3/2 2 11/4 4 4 1 4

pairwise details

pairwise averages

11/4 = (3/2 +4)/2

  • 5/4 = (3/2 - 4)/2

Wavelet Preliminaries

Haar W/T: recursive pairwise calculation of averages and semi- differences (details)

slide-8
SLIDE 8

2 4 2 2 3 5 4

  • 5/4

1/2

  • 1
  • 1

11/4

+ + + + + + + +

  • O(logN) coeffs

needed

Wavelet Preliminaries

  • Initial values can be reconstructed in logarithmic time
  • Similar values for near data - small details
  • Coefficients near the root are more important - normalization

needed

slide-9
SLIDE 9

2 4 2 2 3 5 4

  • 5/4

1/2

  • 1
  • 1

11/4

2 4 2 1 1 4 4 4

+

  • +

+ + + + + +

  • Point Error = 1

Range Sum Error = 1

Wavelet Synopses

Keep B coefficients - Dropped coefficients are considered zero

  • Error introduced to the values of our data
slide-10
SLIDE 10

Outline

  • Introduction
  • Wavelets
  • Error Metrics
  • Algorithms for Point Errors
  • Algorithms for Range Sum Errors
  • Experimental Results
slide-11
SLIDE 11

2

  • 2

8 2 3 4

  • 1

3

  • 1

3 3 5

  • 1

3 1

  • 1
  • 1

5

  • 1
  • 2

1 1

  • 2

Initial Values

After Synopsis Point Errors

Range Sum Error(2:5) = 4

Error Metrics

  • Weighted Error Metrics
  • For point queries :Lwp = Σiw[i]e[i]p
  • For range sum queries: Lwp = Σi≤jw[i,j]e[i:j]p
slide-12
SLIDE 12

Outline

  • Introduction
  • Wavelets
  • Error Metrics
  • Algorithms for Point Errors
  • Algorithms for Range Sum Errors
  • Experimental Results
slide-13
SLIDE 13

Classic Algorithm

  • Minimizes L2 of point errors
  • Selects the B largest normalized coeffs, using a heap
  • Complexity: O(N) space, O(N+BlogN) time

2 4 2 2 3 5 4

  • 5/4

1/2

  • 1
  • 1

11/4

+ + + + + + + +

slide-14
SLIDE 14

Garofalakis - Kumar

  • Minimizes Weighted Error Metrics
  • Dynamic Programming Algorithm on transformation’s tree
  • Complexity: O(N2) Space, O(N2logB) Time

K B-K

Already Kept Coefficients B coefficients available

weights

slide-15
SLIDE 15

w1

w2

Weighted Difference Weighted Average

Matias-Urieli

  • Minimizes Lw2 of point errors
  • Using a modified Haar wavelet transformation, then

apply the classic algorithm

  • Complexity: O(N) space, O(N+BlogN) time
slide-16
SLIDE 16

Outline

  • Introduction
  • Wavelets
  • Error Metrics
  • Algorithms for Point Errors
  • Algorithms for Range Sum Errors
  • Experimental Results
slide-17
SLIDE 17

2 4 3 7 5 5 2

  • 2

4

  • 1

4

  • 2

Prefix Sums Raw Data

2

Haar Transformation On The Prefix Sums

Greedily Pick the Largest B Coeffs

Matias - Urieli

  • Minimizes L2 - Complexity: O(N) space, O(N+BlogN) time
  • Working with prefix sums has disadvantages: sparse data

become dense, difficult to update

slide-18
SLIDE 18

Raw Data

Dyadic Ranges Hierarchy

RangeWave

range-sum query workload

  • Minimizes Weighted-Lp of range sum queries, that follow

a dyadic hierarchy

  • Workload Aware - Applies on Raw Data
slide-19
SLIDE 19

i

Already Kept Coefficients B coeffs available

Weight W[i] K coeffs

B-K coeffs

Compute the error for the corresponding dyadic interval

Raw Data

RangeWave

  • A Dynamic Programming Algorithm
  • Complexity: O(N2logB) time, O(N2) space
slide-20
SLIDE 20

Outline

  • Introduction
  • Wavelets
  • Error Metrics
  • Algorithms for Point Errors
  • Algorithms for Range Sum Errors
  • Experimental Results
slide-21
SLIDE 21

Algorithms Summary

Point Query Workload Dyadic Range Sum Query Workload

Algorithm Time Space Optimal Matias - Urieli N+BlogN N Yes Garofalakis - Kumar N2logB N2 Yes Classic Wavelets N+BlogN N No Classic Histograms N2B NB Yes

Algorithm Time Space Optimal RangeWave N2logB N2 Yes Koudas- Muthukrishnan N7B2 N5B Yes Matias - Urieli N+BlogN N Only for uniform workload Classic N+BlogN N No

slide-22
SLIDE 22

Experimental Study

Point-Query Workloads

  • Data and Point Workload follow Zipfian distribution
  • Increasing Synopsis Size
  • Urieli-Matias provides the best trade-off between accuracy

(weighted L2 error) and running time

slide-23
SLIDE 23

Experimental Study

Unbiased Dyadic Range Sum Query Workload

  • RangeWave exhibits significant accuracy gains as the

synopsis size increases for this workload

  • Classic still performs well
slide-24
SLIDE 24

Experimental Study

Biased Dyadic Range Sum Query Workload

  • Biased Workload : Assigns more significance to larger

range-sum queries

  • The accuracy of RangeWave is orders of magnitude higher
slide-25
SLIDE 25

Conclusions

  • Point Query Workloads: You Get What You Pay

Quadratic algorithms outperform linear ones in accuracy, at a high price

  • Range Sum Query Workloads: We can do better

Find a linear time algorithm for all Range Sum Queries Extend RangeWave to general hierarchy of queries

slide-26
SLIDE 26

Thank You