Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, - - PowerPoint PPT Presentation

toward gpu accelerated data stream processing
SMART_READER_LITE
LIVE PREVIEW

Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, - - PowerPoint PPT Presentation

Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, David Broneske and Gunter Saake University of Magdeburg, Germany May 27, 2015 Background and Motivation Fundamentals, Windowing, GPU Acceleration in DBMS/SPS Data Stream


slide-1
SLIDE 1

Toward GPU Accelerated Data Stream Processing

Marcus Pinnecke, David Broneske and Gunter Saake University of Magdeburg, Germany May 27, 2015

slide-2
SLIDE 2

Background and Motivation

Fundamentals, Windowing, GPU Acceleration in DBMS/SPS

slide-3
SLIDE 3

Examples ■ System Monitoring and Fraud Prevention — Log files about load, network activity, storage ■ Social Media — Identify topics of interest online, such as top-k hash tags on Twitter ■ … Requirements ■ Real-time response ■ Continuous processing and analysis ■ High-volume data, potentially infinite ■ High-velocity data (many changes)

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Application requirements

Data Stream Processing

1

slide-4
SLIDE 4

Infinite streams of data, but… ■ Limited main memory and ■ Only sequential access Solutions ■ Reduction of data amount (e.g., sampling) or ■ Buffering (windowing)

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Processing Model and Windowing

2

slide-5
SLIDE 5

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Processing Model and Windowing Windows

infinite finite

Count-Based

Time-Based

  • More common for real applications
  • Variable number of events per window
  • Problematic due to limited GPU memory

stream of events stream of windows finite finite

3

slide-6
SLIDE 6

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck — Example Join Algorithm

■ Number of join candidates depends on number of events inside window

4

slide-7
SLIDE 7

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck — Example Join Algorithm

■ Number of join candidates depends on number of events inside window ■ Many events in the same instant for time-based windows ■ Decrease of throughput

4

slide-8
SLIDE 8

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck — Back Pressure

Data flow systems (e.g., stream processing) suffer of back pressure Back pressure ■ Upwards-propagated decrease of throughput ■ To the level of the slowest component Results is need for load shedding.

5

slide-9
SLIDE 9

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck

σ σ ⨝

slowest component throughput

6

slide-10
SLIDE 10

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck

σ σ ⨝

slowest component throughput

6

slide-11
SLIDE 11

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck

σ σ ⨝

slowest component throughput

6

slide-12
SLIDE 12

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck

σ σ ⨝

slowest component throughput

6

slide-13
SLIDE 13

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck

σ σ ⨝

slowest component throughput

6

slide-14
SLIDE 14

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck

σ σ ⨝

slowest component throughput

6

slide-15
SLIDE 15

Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Bottleneck — Solutions

■ Parallelization of operators ■ Distributed computation

B C A B C C C A

Site 1 Site 2

more computation resources

7

slide-16
SLIDE 16

Toward GPU Accelerated Data Stream Processing

B C

In DBMS?

A

Site 1 Site 2

CPU

GPU?

7

slide-17
SLIDE 17

Database Management Systems

Toward GPU Accelerated Data Stream Processing

GPUs in DMBS

■ … Efficient co-processor ■ … Might outperform CPUs for certain operations ■ … Computations are highly parallel (SIMD) ■ … Huge corpus on research results

Some conclusions

■ Data transfer costs to and from graphic card are critical ■ Operation should match GPU architecture (e.g., branch free) ■ Operation must be expensive enough to amortize transfer costs ■ Column-oriented architectures save transfer costs

8

slide-18
SLIDE 18

GPU Acceleration for Data Stream Processing

Toward GPU Accelerated Data Stream Processing

Challenges

Limited memory on graphic cards VS (time-based) windows can be huge event representation (tuple) does not match the GPU architecture

9

slide-19
SLIDE 19

GPU-ready Stream Processing

Our 1st contribution: Handle graphic card memory limitation for very large windows via bucketing

slide-20
SLIDE 20

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Bucketing

Portioning streams of variable-length window of tuples into a stream of “Buckets”

We suggest

Bucket: fixed-size window portions with column-oriented event representation

10

slide-21
SLIDE 21

GPU-ready Stream Processing

Bucketing (2)

Bucket-at-a-Time

Let’s say bucket size 3

Bucketing Operator

Bucket-at-a-Time

Let’s say bucket size 5

11

slide-22
SLIDE 22

GPU-ready Stream Processing

Bucketing (2)

7 7 6 6 4 4 5 5 3 3 4 4 5 5 2 2 3 3 1 1

Bucket-at-a-Time

Let’s say bucket size 3

Bucket-at-a-Time

Let’s say bucket size 5

4 5 2 3 1 4 5 2 3 1 2 3 1 2 3 1

Bucketing Operator

11

slide-23
SLIDE 23

4 4 5 5 2 2 3 3 1 1

GPU-ready Stream Processing

Bucketing (2)

Bucket-at-a-Time

Let’s say bucket size 3

Bucket-at-a-Time

Let’s say bucket size 5

4 5 2 3 1 4 5 2 3 1 7 7 8 8 6 6 4 4 5 5 3 3 6 6 2 3 1 2 3 1 5 4 5 4

Bucketing Operator

3 events, column-oriented

11

slide-24
SLIDE 24

7 7 8 8 6 6 4 4 5 5 3 3 2 2 1 1 4 4 3 3 5 5 4 4 5 5 2 2 3 3 1 1

GPU-ready Stream Processing

Bucketing (2)

Bucket-at-a-Time

Let’s say bucket size 3

Bucket-at-a-Time

Let’s say bucket size 5

6 6 7 7 6 6 8 8 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3

Bucketing Operator

5 events, column-oriented

11

slide-25
SLIDE 25

GPU-ready Stream Processing

Bucketing (2)

2 2 1 1 4 4 3 3 5 5

Bucket-at-a-Time

Let’s say bucket size 3

Bucket-at-a-Time

Let’s say bucket size 5

7 7 8 8 6 6 4 4 5 5 3 3 6 6 7 7 6 6 8 8 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3 6 7 4 5 3 6 7 4 5 3

Bucketing Operator

11

slide-26
SLIDE 26

7 8 6 7 8 6

GPU-ready Stream Processing

Bucketing (2)

2 2 1 1 4 4 3 3 5 5

Bucket-at-a-Time

Let’s say bucket size 3

Bucket-at-a-Time

Let’s say bucket size 5

7 7 8 8 6 6 4 4 5 5 3 3 6 6 7 7 6 6 8 8 6 7 4 5 3 6 7 4 5 3 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3 8 8

Bucketing Operator

11

slide-27
SLIDE 27

GPU-ready Stream Processing

Bucketing (2)

4 4 3 3 5 5 7 7 6 6 8 8 2 2 1 1 4 4 3 3 5 5

Bucket-at-a-Time

Let’s say bucket size 3

Bucketing Operator

Bucket-at-a-Time

Let’s say bucket size 5

7 7 8 8 6 6 4 4 5 5 3 3 6 6 7 7 6 6 8 8 6 7 4 5 3 6 7 4 5 3 2 3 1 2 3 1 4 5 2 3 1 4 5 2 3 1 5 4 5 4 4 5 3 4 5 3 8 8 7 8 6 7 8 6 7 8 6 7 8 6

Bucketing Operator

7 8 6 7 8 6

11

slide-28
SLIDE 28

GPU-ready Stream Processing

Bucketing (2)

4 4 3 3 5 5 7 7 6 6 8 8 7 7 8 8 2 2 1 1 4 4 3 3 5 5

Bucket-at-a-Time

Let’s say bucket size 3

Bucket-at-a-Time

Let’s say bucket size 5

6 6 7 7 6 6 8 8

Bucketing Operator

6 7 4 5 3 6 7 4 5 3 4 5 2 3 4 5 2 3 5 4 5 4 4 5 3 4 5 3 8 8 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6

11

slide-29
SLIDE 29

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Benefits through Bucketing

We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)

■ Each operator requests its own bucket size k ■ The bucket size is independent of the actual window length ■ Memory allocation on graphic card has an upper bound for input ■ Bucketing flips event representation ■ Processing entire columns ■ Window length > bucket size, the window is split into portions ■ Single bucketing-operator can be subscribed by many operators

12

slide-30
SLIDE 30

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Buckets versus Windows

We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)

Windowing Bucketing Purpose ■ Bounding infinite stream ■ Portioning windows Consumes ■ Stream of events ■ Stream of windows Produces ■ Stream of windows ■ Stream of buckets #Events ■ Might be huge ■ Has upper bound Events Represention ■ Tuples ■ Column-wise

13

slide-31
SLIDE 31

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

14

slide-32
SLIDE 32

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

( )

14

slide-33
SLIDE 33

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

14

slide-34
SLIDE 34

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

( )

14

slide-35
SLIDE 35

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

14

slide-36
SLIDE 36

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

( )

14

slide-37
SLIDE 37

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

14

slide-38
SLIDE 38

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

( )

14

slide-39
SLIDE 39

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

14

slide-40
SLIDE 40

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

( )

14

slide-41
SLIDE 41

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

14

slide-42
SLIDE 42

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

a b c

6 6 6

( )

14

slide-43
SLIDE 43

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

a b c

6 6 6

14

slide-44
SLIDE 44

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

a b c

6 6 6

a b c

7 7 7

( )

14

slide-45
SLIDE 45

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

a b c

6 6 6

a b c

7 7 7

14

slide-46
SLIDE 46

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

1 1 1

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

a b c

6 6 6

a b c

7 7 7

a b c

8 8 8

( )

14

slide-47
SLIDE 47

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

a b c

6 6 6

a b c

7 7 7

a b c

8 8 8

14

slide-48
SLIDE 48

GPU-ready Stream Processing

Toward GPU Accelerated Data Stream Processing

Achieve bucketing

Slice subscriber 1 Slice subscriber 2 Slice subscriber 3

Ring Buffer 1 Ring Buffer 2 Ring Buffer n

Actual View

Stream Schema Length n

a b c

2 2 2

a b c

3 3 3

a b c

4 4 4

a b c

5 5 5

a b c

6 6 6

a b c

7 7 7

a b c

8 8 8

a b c

9 9 9

( )

14

slide-49
SLIDE 49

Open Research Challenges

Our 2st contribution: Identification of research challenges related to co-processing for Data Stream Processing

slide-50
SLIDE 50

Open Research Challenges

Toward GPU Accelerated Data Stream Processing

Modern hardware and scheduling in Stream Processing

We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)

■ Other specialized co-processors might be possible ■ Intel Xeon Phi or FPGAs for instance ■ Optimized algorithm and executions models for the certain co-processor ■ More than CPU-only Data Stream Processing: ■ Large physical query execution plan space ■ Find best performance for a ■ Logic plan and ■ Load sharing between devices Further research should be investigated to find limitations and benefits for applying modern hardware here

15

slide-51
SLIDE 51

Conclusion

slide-52
SLIDE 52

Conclusion

Toward GPU Accelerated Data Stream Processing

Bucketing windows enables GPU-ready Data Stream Processing for very large windows

We suggest a technique called bucketing, that portions each stream of vary- length window of tuples (events) into a stream of fixed-size window portions with column-orientated event representation (Buckets)

■ Memory allocation has upper bound for input (fixed-size) ■ Reduces transfer costs (column-selection)

We present an approach to achieve bucketing

■ Separat operator, independent of SPS’s tuple-at-a-time or batch-at-a-time support ■ Ring buffer per attribute plus per-subscriber slice ■ Enables processing of large-scale windows on limited graphic card memory ■ No fallback to CPU required

We identify research challenges for further co-processing in this context.

■ Other co-processors with specialized algorithm — limitations and benefits ■ Large search space for query plans (logical operator — devices — concrete algorithm) — optimizer

16