Toward GPU Accelerated Data Stream Processing
Marcus Pinnecke, David Broneske and Gunter Saake University of Magdeburg, Germany May 27, 2015
Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, - - PowerPoint PPT Presentation
Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, David Broneske and Gunter Saake University of Magdeburg, Germany May 27, 2015 Background and Motivation Fundamentals, Windowing, GPU Acceleration in DBMS/SPS Data Stream
Marcus Pinnecke, David Broneske and Gunter Saake University of Magdeburg, Germany May 27, 2015
Fundamentals, Windowing, GPU Acceleration in DBMS/SPS
Examples ■ System Monitoring and Fraud Prevention — Log files about load, network activity, storage ■ Social Media — Identify topics of interest online, such as top-k hash tags on Twitter ■ … Requirements ■ Real-time response ■ Continuous processing and analysis ■ High-volume data, potentially infinite ■ High-velocity data (many changes)
Toward GPU Accelerated Data Stream Processing
1
Infinite streams of data, but… ■ Limited main memory and ■ Only sequential access Solutions ■ Reduction of data amount (e.g., sampling) or ■ Buffering (windowing)
Toward GPU Accelerated Data Stream Processing
2
Toward GPU Accelerated Data Stream Processing
infinite finite
Count-Based
stream of events stream of windows finite finite
3
Toward GPU Accelerated Data Stream Processing
■ Number of join candidates depends on number of events inside window
4
Toward GPU Accelerated Data Stream Processing
■ Number of join candidates depends on number of events inside window ■ Many events in the same instant for time-based windows ■ Decrease of throughput
4
Toward GPU Accelerated Data Stream Processing
Data flow systems (e.g., stream processing) suffer of back pressure Back pressure ■ Upwards-propagated decrease of throughput ■ To the level of the slowest component Results is need for load shedding.
5
Toward GPU Accelerated Data Stream Processing
slowest component throughput
6
Toward GPU Accelerated Data Stream Processing
slowest component throughput
6
Toward GPU Accelerated Data Stream Processing
slowest component throughput
6
Toward GPU Accelerated Data Stream Processing
slowest component throughput
6
Toward GPU Accelerated Data Stream Processing
slowest component throughput
6
Toward GPU Accelerated Data Stream Processing
slowest component throughput
6
Toward GPU Accelerated Data Stream Processing
■ Parallelization of operators ■ Distributed computation
B C A B C C C A
Site 1 Site 2
more computation resources
7
Toward GPU Accelerated Data Stream Processing
B C
A
Site 1 Site 2
7
Toward GPU Accelerated Data Stream Processing
GPUs in DMBS
■ … Efficient co-processor ■ … Might outperform CPUs for certain operations ■ … Computations are highly parallel (SIMD) ■ … Huge corpus on research results
Some conclusions
■ Data transfer costs to and from graphic card are critical ■ Operation should match GPU architecture (e.g., branch free) ■ Operation must be expensive enough to amortize transfer costs ■ Column-oriented architectures save transfer costs
8
Toward GPU Accelerated Data Stream Processing
9
Our 1st contribution: Handle graphic card memory limitation for very large windows via bucketing
Toward GPU Accelerated Data Stream Processing
Portioning streams of variable-length window of tuples into a stream of “Buckets”
Bucket: fixed-size window portions with column-oriented event representation
10
Let’s say bucket size 3
Bucketing Operator
Let’s say bucket size 5
11
7 7 6 6 4 4 5 5 3 3 4 4 5 5 2 2 3 3 1 1
Let’s say bucket size 3
Let’s say bucket size 5
4 5 2 3 1 4 5 2 3 1 2 3 1 2 3 1
Bucketing Operator
11
4 4 5 5 2 2 3 3 1 1
Let’s say bucket size 3
Let’s say bucket size 5
4 5 2 3 1 4 5 2 3 1 7 7 8 8 6 6 4 4 5 5 3 3 6 6 2 3 1 2 3 1 5 4 5 4
Bucketing Operator
3 events, column-oriented
11
7 7 8 8 6 6 4 4 5 5 3 3 2 2 1 1 4 4 3 3 5 5 4 4 5 5 2 2 3 3 1 1
Let’s say bucket size 3
Let’s say bucket size 5
6 6 7 7 6 6 8 8 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3
Bucketing Operator
5 events, column-oriented
11
2 2 1 1 4 4 3 3 5 5
Let’s say bucket size 3
Let’s say bucket size 5
7 7 8 8 6 6 4 4 5 5 3 3 6 6 7 7 6 6 8 8 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3 6 7 4 5 3 6 7 4 5 3
Bucketing Operator
11
7 8 6 7 8 6
2 2 1 1 4 4 3 3 5 5
Let’s say bucket size 3
Let’s say bucket size 5
7 7 8 8 6 6 4 4 5 5 3 3 6 6 7 7 6 6 8 8 6 7 4 5 3 6 7 4 5 3 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3 8 8
Bucketing Operator
11
4 4 3 3 5 5 7 7 6 6 8 8 2 2 1 1 4 4 3 3 5 5
Let’s say bucket size 3
Bucketing Operator
Let’s say bucket size 5
7 7 8 8 6 6 4 4 5 5 3 3 6 6 7 7 6 6 8 8 6 7 4 5 3 6 7 4 5 3 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3 8 8 7 8 6 7 8 6 7 8 6 7 8 6
Bucketing Operator
7 8 6 7 8 6
11
4 4 3 3 5 5 7 7 6 6 8 8 7 7 8 8 2 2 1 1 4 4 3 3 5 5
Let’s say bucket size 3
Let’s say bucket size 5
6 6 7 7 6 6 8 8
Bucketing Operator
6 7 4 5 3 6 7 4 5 3 4 5 2 3 4 5 2 3 5 4 5 4 4 5 3 4 5 3 8 8 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6
11
Toward GPU Accelerated Data Stream Processing
We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)
■ Each operator requests its own bucket size k ■ The bucket size is independent of the actual window length ■ Memory allocation on graphic card has an upper bound for input ■ Bucketing flips event representation ■ Processing entire columns ■ Window length > bucket size, the window is split into portions ■ Single bucketing-operator can be subscribed by many operators
12
Toward GPU Accelerated Data Stream Processing
We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)
Windowing Bucketing Purpose ■ Bounding infinite stream ■ Portioning windows Consumes ■ Stream of events ■ Stream of windows Produces ■ Stream of windows ■ Stream of buckets #Events ■ Might be huge ■ Has upper bound Events Represention ■ Tuples ■ Column-wise
13
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
14
Toward GPU Accelerated Data Stream Processing
Slice subscriber 1 Slice subscriber 2 Slice subscriber 3
Ring Buffer 1 Ring Buffer 2 Ring Buffer n
Actual View
Stream Schema Length n
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
14
Our 2st contribution: Identification of research challenges related to co-processing for Data Stream Processing
Toward GPU Accelerated Data Stream Processing
We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)
■ Other specialized co-processors might be possible ■ Intel Xeon Phi or FPGAs for instance ■ Optimized algorithm and executions models for the certain co-processor ■ More than CPU-only Data Stream Processing: ■ Large physical query execution plan space ■ Find best performance for a ■ Logic plan and ■ Load sharing between devices Further research should be investigated to find limitations and benefits for applying modern hardware here
15
Toward GPU Accelerated Data Stream Processing
We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)
■ Memory allocation has upper bound for input (fixed-size) ■ Reduces transfer costs (column-selection)
■ Separat operator, independent of SPS’s tuple-at-a-time or batch-at-a-time support ■ Ring buffer per attribute plus per-subscriber slice ■ Enables processing of large-scale windows on limited graphic card memory ■ No fallback to CPU required
■ Other co-processors with specialized algorithm — limitations and benefits ■ Large search space for query plans (logical operator — devices — concrete algorithm) — optimizer
16