Request Window: an Approach to Improve Throughput of RDBMS-based - - PowerPoint PPT Presentation

request window
SMART_READER_LITE
LIVE PREVIEW

Request Window: an Approach to Improve Throughput of RDBMS-based - - PowerPoint PPT Presentation

Request Window: an Approach to Improve Throughput of RDBMS-based Data Integration System by Utilizing Data Sharing Across Concurrent Distributed Queries Rubao Lee, Minghong Zhou, Huaming Liao lirubao@software.ict.ac.cn Institute of Computing


slide-1
SLIDE 1

Request Window:

an Approach to Improve Throughput of RDBMS-based Data Integration System by Utilizing Data Sharing Across Concurrent Distributed Queries

Rubao Lee, Minghong Zhou, Huaming Liao lirubao@software.ict.ac.cn Institute of Computing Technology Chinese Academy of Sciences

VLDB 2007

1

slide-2
SLIDE 2

Outline

  • Motivation: To Improve DQP Throughput
  • Solution: Request Window
  • Evaluation: Distributed TPC-H Queries
  • Classification: Data Sharing Mechanisms
  • Conclusion: Summary and Future Work

2

slide-3
SLIDE 3

3

Typical Data Integration Service

Middleware? Virtual DBMS? InfoGrid? Middleware? Virtual DBMS? InfoGrid?

slide-4
SLIDE 4

From DBMS to Data Integration System

The Key: Distributed Query Processing (DQP)

4

TableScan

Interacting with storage devices

SeqScan,IndexScan,BitmapScan

RemoteScan

Interacting with data source wrappers

Issuing data requests / fetching results

New Leaf Node in Query Plan Tree New Leaf Node in Query Plan Tree Data Source Wrapper Data Source Wrapper Data Source Data Source Leaf Node

RemoteScan

Leaf Node

RemoteScan

IBM DB2 Information Integrator/ MS SQL Server 2005/ IGNITE

slide-5
SLIDE 5
  • Only Consider how to execute a single query faster

– Distributed Query Optimizer – New Join Algorithms – Adaptive Query Processing

5

GOAL: Increase Overall DQP Throughput

How to execute multiple concurrent queries more efficiently ?

The key problem

slide-6
SLIDE 6

Utilizing data sharing across concurrent queries to hide unnecessary I/O operations

Two factors of Distributed Query Processing

Network Speed, Source Burden

  • Reducing unnecessary network transfers
  • Reducing burdens of data sources

6

Data Sharing Is Important for DQP

slide-7
SLIDE 7
  • DBMS’s query execution model:

– One connection, one process – Execute each query in an independent process – Use a global buffer pool manager

  • Foundation: Memory-Disk

– Concurrent query processes can share disk pages! – Page Replacement Algorithm (LRU, ARC, 2Q, LIRS,…)

7

Data Sharing inside DBMS

slide-8
SLIDE 8
  • DQP inherits the underlying execution model

– Independently executing each distributed query

  • But, no available buffer pool manager
  • Data sources are not for random-access!

– Issue a SQL and fetch a resultset (DBMS) – Issue a HTTP request and get a response (WebPage) – Issue a SOAP message and get a SOAP message (SOA)

8

But, No Mem/Disk Hierarchy for DQP

slide-9
SLIDE 9

9

No data sharing for DQP

  • Each query execution process has to interact with

data sources independently!

Redundant data requests issued to data sources Redundant result data transferred over network The total throughput is limited by network speed and computing power of sources!

slide-10
SLIDE 10

Outline

  • Motivation: To Improve DQP Throughput
  • Solution: Request Window
  • Evaluation: Distributed TPC-H Queries
  • Classification: Data Sharing Mechanisms
  • Conclusion: Summary and Future Work

10

slide-11
SLIDE 11
  • Request Window: a batch-processing approach

– Combining multiple data requests and dispatching results

11

Overview of Request Window

slide-12
SLIDE 12

Foundation of Request Window

  • Main idea: Decouple wrappers from query engine

– A wrapper is in an independent process – Use IPC to connect wrappers and query engine

  • Two Phases: by iterator model

– Start: engine sends data request to wrapper (open) – Fetch: engine fetches result tuples from wrapper (next)

12

Start-Fetch Wrapper

slide-13
SLIDE 13
  • Parallelized query execution:

– Wrappers can prefetch next tuples while query engine is consuming old tuples.

  • The independent wrapper process can be a

common place for multiple query engine processes.

– The global buffer pool manager in DBMS! – Data sharing of multiple query processes can be possible!

13

Two Benefits of Start-Fetch

slide-14
SLIDE 14
  • Each data request will be inserted into a corresponding

waiting queue ( a request window):

  • The data request will not be issued immediately
  • At a time, the window will be issued:

1: Combining all requests into a common request:

  • Select (columns) from a_table where (predict);
  • Generating a synthesized where clause.

2: Sending the request to the data source and receiving resultset 3: Dispatching resultset to each participating query engine process

14

What’s A Request Window?

When?

slide-15
SLIDE 15

15

Window Size

Window Size: from window-creating to window-issuing!

How to determine the window size?

A large window size: More data requests can be collected. But, early requests have to wait! (unfair)

slide-16
SLIDE 16
  • DIOP: Delay Indicated by OPtimizer

– Let the query optimizer indicate a tolerable delay time for each data request

  • DAW: Dynamically Adjusting Window

– Adjust the window size when a new data request arrives

16

To Determine Window Size

slide-17
SLIDE 17

The iterator model : tuple fetching on demand! The iterator model : tuple fetching on demand!

17

DIOP: Why a request can be delayed?

Hash Join 1 Hash Join 1 Hash Join 2 Hash Join 2 Hash 1 Hash 2 R_Scan 1:

ORDERS

R_Scan 2:

PARTSUPP

R_Scan 3:

LINEITEM

ORDERS⋈ LINEITEM ⋈ PARTSUPP

time

R_Scan 1 R_Scan 2 R_Scan 3

I am waiting for Hash 1. I am waiting for Hash 2.

The execution can be divided into several phases

slide-18
SLIDE 18

pipelined data fetch Be ready for your tuples when I need them!

18

DIOP: How long a request can be delayed?

ID: Initial Delay

WO: Wait Opportunity Maximized Delay Time of a request R generated by a leaf node N

slide-19
SLIDE 19
  • “Wait Opportunity” of a node N

– For non-root node:

  • ARD: Algorithm Related Delay
  • Different relational operators have different ARDs

– Hashjoin/Mergejoin – Union/intersection/difference

19

DIOP: Algorithm-Related-Delay

Fetch Parent node’s tuples Fetch this node’s tuples time

slide-20
SLIDE 20

20

DIOP: Estimation for Hash-Join Tree

Leaf deep hash-join tree Right deep hash-join tree

Only consider time for data transfers over network

wait opportunity? Finishing all these requests

slide-21
SLIDE 21
  • Remember the goal: to determine window size
  • DIOP is just the first step:

– Each data request has an annotation of its maximized delay time

  • A coordinator is required to determine the window

size on the basis of delay times of all participating requests Adjust window size when a new request arrives

21

DAW: Dynamically Adjust Window

slide-22
SLIDE 22
  • A background working-thread (wakes up : 1 second)

– Resetting window size (if not ready) – Issuing window (if time out)

  • Window Adjusting Policy (when a new request arrives)

– Emergency-oriented policy

WS = MDT if MDT < WS

– Throughput-oriented policy (DSS Queries)

22

DAW: Mechanism and Policy

The window size will never be increased!

WS: Window Size MDT: Maximized Delay Time of the new request RC: number of requests in the current window

slide-23
SLIDE 23

Outline

  • Motivation: To Improve DQP Throughput
  • Solution: Request Window
  • Evaluation: Distributed TPC-H Queries
  • Classification: Data Sharing Mechanisms
  • Conclusion: Summary and Future Work

23

slide-24
SLIDE 24
  • IGNITE: on top of PostgreSQL
  • TPC-H: 100MB (scale 0.1)
  • IGNITE Machine:

– Intel P4 Xeon 2.4GHz x4, 2GB Mem, Linux 2.4.18 SMP

  • Data source Machines:

– Intel P4 2.8GHz, 512MB Mem, Freebsd 5.4 – PostgreSQL – Each TPC-H table is provided by a data source

  • 100M LAN

24

Experiments Setup

slide-25
SLIDE 25

25

Improvement of Overall Throughput

100 200 300 400 500 600 2 4 6 8 10 12 Number of Clients Throughput(queries/hour) SIGMODXYZ with Request Window SIGMODXYZ DBMS X

Up to a 1.7x speedup

slide-26
SLIDE 26

Outline

  • Motivation: To Improve DQP Throughput
  • Solution: Request Window
  • Evaluation: Distributed TPC-H Queries
  • Classification: Data Sharing Mechanisms
  • Conclusion: Summary and Future Work

26

slide-27
SLIDE 27
  • Two Correlated Factors:

– Restriction on interarrival times (deadline for sharing) – Amount of shared data (We can share data, but how much?)

27

Related Data Sharing Techniques

Restriction on Interarrival Times Amount of Shared Data Request Window Table Scan Piggybacking Multi Query Optimization

All queries must arrive simultaneously!

This is a rough comparison!

Page caching

No restriction!

slide-28
SLIDE 28
  • In a word:

Improve total throughput without sacrificing the response time of individual query execution

  • Request Window is suitable for running concurrent DSS queries

28

Discussions and Future Work

  • It is hard to make exactly estimation for delay opportunities
  • Add Window Notification Mechanism
  • Monitoring query execution progress
  • Notifying wrapper to issue window
slide-29
SLIDE 29

29

Thank You