[PPT] - Request Window: an Approach to Improve Throughput of RDBMS-based PowerPoint Presentation

SLIDE 1

Request Window:

an Approach to Improve Throughput of RDBMS-based Data Integration System by Utilizing Data Sharing Across Concurrent Distributed Queries

Rubao Lee, Minghong Zhou, Huaming Liao lirubao@software.ict.ac.cn Institute of Computing Technology Chinese Academy of Sciences

VLDB 2007

1

SLIDE 2

Outline

Motivation: To Improve DQP Throughput
Solution: Request Window
Evaluation: Distributed TPC-H Queries
Classification: Data Sharing Mechanisms
Conclusion: Summary and Future Work

2

SLIDE 3

3

Typical Data Integration Service

Middleware? Virtual DBMS? InfoGrid? Middleware? Virtual DBMS? InfoGrid?

SLIDE 4

From DBMS to Data Integration System

The Key: Distributed Query Processing (DQP)

4

TableScan

Interacting with storage devices

SeqScan,IndexScan,BitmapScan

RemoteScan

Interacting with data source wrappers

Issuing data requests / fetching results

New Leaf Node in Query Plan Tree New Leaf Node in Query Plan Tree Data Source Wrapper Data Source Wrapper Data Source Data Source Leaf Node

RemoteScan

Leaf Node

RemoteScan

IBM DB2 Information Integrator/ MS SQL Server 2005/ IGNITE

SLIDE 5

Only Consider how to execute a single query faster

– Distributed Query Optimizer – New Join Algorithms – Adaptive Query Processing

5

GOAL: Increase Overall DQP Throughput

How to execute multiple concurrent queries more efficiently ?

The key problem

SLIDE 6

Utilizing data sharing across concurrent queries to hide unnecessary I/O operations

Two factors of Distributed Query Processing

Network Speed, Source Burden

Reducing unnecessary network transfers
Reducing burdens of data sources

6

Data Sharing Is Important for DQP

SLIDE 7

DBMS’s query execution model:

– One connection, one process – Execute each query in an independent process – Use a global buffer pool manager

Foundation: Memory-Disk

– Concurrent query processes can share disk pages! – Page Replacement Algorithm (LRU, ARC, 2Q, LIRS,…)

7

Data Sharing inside DBMS

SLIDE 8

DQP inherits the underlying execution model

– Independently executing each distributed query

But, no available buffer pool manager
Data sources are not for random-access!

– Issue a SQL and fetch a resultset (DBMS) – Issue a HTTP request and get a response (WebPage) – Issue a SOAP message and get a SOAP message (SOA)

8

But, No Mem/Disk Hierarchy for DQP

SLIDE 9

9

No data sharing for DQP

Each query execution process has to interact with

data sources independently!

Redundant data requests issued to data sources Redundant result data transferred over network The total throughput is limited by network speed and computing power of sources!

SLIDE 10

Outline

Motivation: To Improve DQP Throughput
Solution: Request Window
Evaluation: Distributed TPC-H Queries
Classification: Data Sharing Mechanisms
Conclusion: Summary and Future Work

10

SLIDE 11

Request Window: a batch-processing approach

– Combining multiple data requests and dispatching results

11

Overview of Request Window

SLIDE 12

Foundation of Request Window

Main idea: Decouple wrappers from query engine

– A wrapper is in an independent process – Use IPC to connect wrappers and query engine

Two Phases: by iterator model

– Start: engine sends data request to wrapper (open) – Fetch: engine fetches result tuples from wrapper (next)

12

Start-Fetch Wrapper

SLIDE 13

Parallelized query execution:

– Wrappers can prefetch next tuples while query engine is consuming old tuples.

The independent wrapper process can be a

common place for multiple query engine processes.

– The global buffer pool manager in DBMS! – Data sharing of multiple query processes can be possible!

13

Two Benefits of Start-Fetch

SLIDE 14

Each data request will be inserted into a corresponding

waiting queue ( a request window):

The data request will not be issued immediately
At a time, the window will be issued:

1: Combining all requests into a common request:

Select (columns) from a_table where (predict);
Generating a synthesized where clause.

2: Sending the request to the data source and receiving resultset 3: Dispatching resultset to each participating query engine process

14

What’s A Request Window?

When?

SLIDE 15

15

Window Size

Window Size: from window-creating to window-issuing!

How to determine the window size?

A large window size: More data requests can be collected. But, early requests have to wait! (unfair)

SLIDE 16

DIOP: Delay Indicated by OPtimizer

– Let the query optimizer indicate a tolerable delay time for each data request

DAW: Dynamically Adjusting Window

– Adjust the window size when a new data request arrives

16

To Determine Window Size

SLIDE 17

The iterator model : tuple fetching on demand! The iterator model : tuple fetching on demand!

17

DIOP: Why a request can be delayed?

Hash Join 1 Hash Join 1 Hash Join 2 Hash Join 2 Hash 1 Hash 2 R_Scan 1:

ORDERS

R_Scan 2:

PARTSUPP

R_Scan 3:

LINEITEM

ORDERS⋈ LINEITEM ⋈ PARTSUPP

time

R_Scan 1 R_Scan 2 R_Scan 3

I am waiting for Hash 1. I am waiting for Hash 2.

The execution can be divided into several phases

SLIDE 18

pipelined data fetch Be ready for your tuples when I need them!

18

DIOP: How long a request can be delayed?

ID: Initial Delay

WO: Wait Opportunity Maximized Delay Time of a request R generated by a leaf node N

SLIDE 19

“Wait Opportunity” of a node N

– For non-root node:

ARD: Algorithm Related Delay
Different relational operators have different ARDs

– Hashjoin/Mergejoin – Union/intersection/difference

19

DIOP: Algorithm-Related-Delay

Fetch Parent node’s tuples Fetch this node’s tuples time

SLIDE 20

20

DIOP: Estimation for Hash-Join Tree

Leaf deep hash-join tree Right deep hash-join tree

Only consider time for data transfers over network

wait opportunity? Finishing all these requests

SLIDE 21

Remember the goal: to determine window size
DIOP is just the first step:

– Each data request has an annotation of its maximized delay time

A coordinator is required to determine the window

size on the basis of delay times of all participating requests Adjust window size when a new request arrives

21

DAW: Dynamically Adjust Window

SLIDE 22

A background working-thread (wakes up : 1 second)

– Resetting window size (if not ready) – Issuing window (if time out)

Window Adjusting Policy (when a new request arrives)

– Emergency-oriented policy

WS = MDT if MDT < WS

– Throughput-oriented policy (DSS Queries)

22

DAW: Mechanism and Policy

The window size will never be increased!

WS: Window Size MDT: Maximized Delay Time of the new request RC: number of requests in the current window

SLIDE 23

Outline

Motivation: To Improve DQP Throughput
Solution: Request Window
Evaluation: Distributed TPC-H Queries
Classification: Data Sharing Mechanisms
Conclusion: Summary and Future Work

23

SLIDE 24

IGNITE: on top of PostgreSQL
TPC-H: 100MB (scale 0.1)
IGNITE Machine:

– Intel P4 Xeon 2.4GHz x4, 2GB Mem, Linux 2.4.18 SMP

Data source Machines:

– Intel P4 2.8GHz, 512MB Mem, Freebsd 5.4 – PostgreSQL – Each TPC-H table is provided by a data source

100M LAN

24

Experiments Setup

SLIDE 25

25

Improvement of Overall Throughput

100 200 300 400 500 600 2 4 6 8 10 12 Number of Clients Throughput(queries/hour) SIGMODXYZ with Request Window SIGMODXYZ DBMS X

Up to a 1.7x speedup

SLIDE 26

Outline

Motivation: To Improve DQP Throughput
Solution: Request Window
Evaluation: Distributed TPC-H Queries
Classification: Data Sharing Mechanisms
Conclusion: Summary and Future Work

26

SLIDE 27

Two Correlated Factors:

– Restriction on interarrival times (deadline for sharing) – Amount of shared data (We can share data, but how much?)

27

Related Data Sharing Techniques

Restriction on Interarrival Times Amount of Shared Data Request Window Table Scan Piggybacking Multi Query Optimization

All queries must arrive simultaneously!

This is a rough comparison!

Page caching

No restriction!

SLIDE 28

In a word:

Improve total throughput without sacrificing the response time of individual query execution

Request Window is suitable for running concurrent DSS queries

28

Discussions and Future Work

It is hard to make exactly estimation for delay opportunities
Add Window Notification Mechanism
Monitoring query execution progress
Notifying wrapper to issue window

SLIDE 29

29

Request Window:

an Approach to Improve Throughput of RDBMS-based Data Integration System by Utilizing Data Sharing Across Concurrent Distributed Queries

Rubao Lee, Minghong Zhou, Huaming Liao lirubao@software.ict.ac.cn Institute of Computing Technology Chinese Academy of Sciences

Outline

Typical Data Integration Service

From DBMS to Data Integration System

The Key: Distributed Query Processing (DQP)

TableScan

RemoteScan

– Distributed Query Optimizer – New Join Algorithms – Adaptive Query Processing

GOAL: Increase Overall DQP Throughput

How to execute multiple concurrent queries more efficiently ?

The key problem

Utilizing data sharing across concurrent queries to hide unnecessary I/O operations

Two factors of Distributed Query Processing

Network Speed, Source Burden

Data Sharing Is Important for DQP

– One connection, one process – Execute each query in an independent process – Use a global buffer pool manager

– Concurrent query processes can share disk pages! – Page Replacement Algorithm (LRU, ARC, 2Q, LIRS,…)

Data Sharing inside DBMS

– Independently executing each distributed query

But, No Mem/Disk Hierarchy for DQP

No data sharing for DQP

data sources independently!

Redundant data requests issued to data sources Redundant result data transferred over network The total throughput is limited by network speed and computing power of sources!

Outline

Overview of Request Window

Foundation of Request Window

Start-Fetch Wrapper

– Wrappers can prefetch next tuples while query engine is consuming old tuples.

common place for multiple query engine processes.

– The global buffer pool manager in DBMS! – Data sharing of multiple query processes can be possible!

Two Benefits of Start-Fetch

waiting queue ( a request window):

1: Combining all requests into a common request:

2: Sending the request to the data source and receiving resultset 3: Dispatching resultset to each participating query engine process

What’s A Request Window?

When?

Window Size

Window Size: from window-creating to window-issuing!

How to determine the window size?

– Let the query optimizer indicate a tolerable delay time for each data request

– Adjust the window size when a new data request arrives

To Determine Window Size

The iterator model : tuple fetching on demand! The iterator model : tuple fetching on demand!

DIOP: Why a request can be delayed?

pipelined data fetch Be ready for your tuples when I need them!

DIOP: How long a request can be delayed?

ID: Initial Delay

– For non-root node:

DIOP: Algorithm-Related-Delay

DIOP: Estimation for Hash-Join Tree

– Each data request has an annotation of its maximized delay time

size on the basis of delay times of all participating requests Adjust window size when a new request arrives

DAW: Dynamically Adjust Window

WS = MDT if MDT < WS

DAW: Mechanism and Policy

The window size will never be increased!

Outline

– Intel P4 2.8GHz, 512MB Mem, Freebsd 5.4 – PostgreSQL – Each TPC-H table is provided by a data source

Experiments Setup

Improvement of Overall Throughput

Up to a 1.7x speedup

Outline

Related Data Sharing Techniques

Improve total throughput without sacrificing the response time of individual query execution

Discussions and Future Work

Thank You