Real-Time On-line Analytical Processing (OLAP) On Multi-Core and - - PowerPoint PPT Presentation

real time on line analytical processing olap on multi
SMART_READER_LITE
LIVE PREVIEW

Real-Time On-line Analytical Processing (OLAP) On Multi-Core and - - PowerPoint PPT Presentation

Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures Frank Dehne School of Computer Science Centre For Advanced Studies Canada Frank Dehne www.dehne.net Parallel Data Analytics Joint work with R.Bordawekar


slide-1
SLIDE 1

Frank Dehne ■ www.dehne.net

Frank Dehne

School of Computer Science Centre For Advanced Studies Canada

Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures

slide-2
SLIDE 2

Frank Dehne ■ www.dehne.net

Parallel Data Analytics

Joint work with R.Bordawekar (IBM Yorktown), J.Dale (IBM Littletown), R.Grosset (IBM Toronto), M.Genkin (IBM Toronto), S.Jou (IBM Toronto), P.Jain (IBM Littletown), M.Petitclerc (IBM Laval), A.Rau- Chaplin (Dalhousie), D.Robillard (Carleton), F.Thomas (IBM Ottawa), H.Zaboli (Carleton), R.Zhou (Carleton).

slide-3
SLIDE 3

Frank Dehne ■ www.dehne.net

Online Analytical Processing (OLAP)

IBM/COGNOS

  • Insight
  • Workspace
  • Report/Studio
slide-4
SLIDE 4

Frank Dehne ■ www.dehne.net

Online Analytical Processing (OLAP)

slide-5
SLIDE 5

Frank Dehne ■ www.dehne.net

Online Analytical Processing (OLAP)

A B C AB AC BC ABC

Operations:

  • roll-up
  • drill-down
  • slice
  • dice
slide-6
SLIDE 6

Frank Dehne ■ www.dehne.net

Online Analytical Processing (OLAP)

A B C AB AC BC ABC

Operations:

  • roll-up
  • drill-down
  • slice
  • dice

ABCD ABC ABD ACD BCD AB AC AD BC BD CD A A B C D D All

slide-7
SLIDE 7

Frank Dehne ■ www.dehne.net

Online Analytical Processing (OLAP)

A B C AB AC BC ABC

Operations:

  • roll-up
  • drill-down
  • slice
  • dice

Traditional: Data Cube

Pre-compute group-bys to improve query response time. Static or Batch Updates

ABCD ABC ABD ACD BCD AB AC AD BC BD CD A A B C D D All

slide-8
SLIDE 8

Frank Dehne ■ www.dehne.net

OLAP vs. OLTP

OLTP System OLAP System Source of data Operational data Consolidated data Purpose of data Business operations Planning, decision support Type of data Snapshot of ongoing business Multi-dimensional views of “historic” data Updates Small and fast Periodic long-running batch jobs Queries Relatively simple, involving few data records Often complex, involving aggregations of large data sets Processing speed Typically very fast Depends on amount of data involved; batch updates and complex queries may take many hours

Source: AcceleratedAnalytics.com

slide-9
SLIDE 9

Frank Dehne ■ www.dehne.net

The Five V's Of “Big Data”

  • Volume
  • Velocity
  • Variety
  • Veracity
  • Value

ABCD ABC ABD ACD BCD AB AC AD BC BD CD A A B C D D All

slide-10
SLIDE 10

Frank Dehne ■ www.dehne.net

Real-Time OLAP

  • Avoid static data cube

structure and batch updates.

  • Stream of insert and OLAP

query operations.

  • Inserts are immediate.
  • OLAP queries operate on

latest up-to-date data set.

A B C AB AC BC ABC

Insert & Query Stream Real-Time OLAP Engine Query Results

slide-11
SLIDE 11

Frank Dehne ■ www.dehne.net

Real-Time OLAP

  • Problem: Performance
  • Static data cube was

introduced to improve performance...

A B C AB AC BC ABC

Insert & Query Stream Real-Time OLAP Engine Query Results

slide-12
SLIDE 12

Frank Dehne ■ www.dehne.net

Real-Time OLAP

Research Question: Can parallel computing be used to improve performance for real-time OLAP?

A B C AB AC BC ABC

Insert & Query Stream Real-Time OLAP Engine Query Results

slide-13
SLIDE 13

Frank Dehne ■ www.dehne.net

Parallel Computing

shared memory distributed memory

Multi-core Processor Cloud / Cluster

slide-14
SLIDE 14

Frank Dehne ■ www.dehne.net

Real-Time OLAP on Multi-Core Processors

A B C AB AC BC ABC

Insert & Query Stream Real-Time OLAP Engine Query Results

slide-15
SLIDE 15

Frank Dehne ■ www.dehne.net

Real-Time OLAP on Multi-Core Processors

Insert & Query Stream Real-Time OLAP Engine Query Results Parallel DC-Tree

slide-16
SLIDE 16

Frank Dehne ■ www.dehne.net

Real-Time OLAP on Multi-Core Processors

Insert & Query Stream Real-Time OLAP Engine Query Results Parallel DC-Tree

  • Multidimensional tree data

structure.

  • Operations: insert and query.
  • Enhanced for data

aggregation and dimension hierarchies (Kriegel et.al., ICDE 2000)

  • Enhanced for multi-core

parallel computing (Dehne et.al., CCGrid 2012)

slide-17
SLIDE 17

Frank Dehne ■ www.dehne.net

Sequential DC-Tree

  • Ester, Kohlhammer, Kriegel

(ICDE 2000).

  • Adaptation of R-tree for OLAP.
  • Replaces total ordering by

conceptual hierarchies.

  • Replaces minimum bounding

rectangles (MBR) by minimum describing sets (MDS).

  • Adds internal directory nodes.

R-Tree

slide-18
SLIDE 18

Frank Dehne ■ www.dehne.net

Conceptual Hierarchies

slide-19
SLIDE 19

Frank Dehne ■ www.dehne.net

Conceptual Hierarchies

Data representation:

slide-20
SLIDE 20

Frank Dehne ■ www.dehne.net

Minimum Describing Set (MDS)

MBR MDS

slide-21
SLIDE 21

Frank Dehne ■ www.dehne.net

Parallel DC-Tree

parallel DC-tree multi-core processor memory

inserts/queries results

Stream of

  • Inserts
  • OLAP queries

(Dehne et.al., CCGrid 2012)

slide-22
SLIDE 22

Frank Dehne ■ www.dehne.net

Parallel DC-Tree

Parallelization:

  • Insert and OLAP query
  • perations are executed

concurrently.

  • OLAP query operations that

need to search multiple subtrees of a node are split into multiple concurrent processes.

parallel DC-tree multi-core processor memory

inserts/queries results

slide-23
SLIDE 23

Frank Dehne ■ www.dehne.net

Parallel DC-Tree

Main Problems:

  • Interference between concurrent

insert and OLAP query operations.

  • Consistency (Strong Serialization):

OLAP query results have to include transient inserts that have been issued prior.

parallel DC-tree multi-core processor memory

inserts/queries results

slide-24
SLIDE 24

Frank Dehne ■ www.dehne.net

Parallel DC-Tree

Race Conditions:

  • Inserts and queries run at different

speeds.

  • Insert traverse root to leaf and back

to root

  • Queries need to traverse subtrees

depending on data volume to be aggregated.

  • Insert and query operations can
  • vertake each other.

parallel DC-tree multi-core processor memory

inserts/queries results

slide-25
SLIDE 25

Frank Dehne ■ www.dehne.net

Data Structure

D1 1 20

R L1 3

10

L2 6

10 D2 2 20

L3 4

10

L4 5

10 ID

Time Stamp Measure

Add:

  • Right Sibling Links
  • Time Stamps

MDS List

slide-26
SLIDE 26

Frank Dehne ■ www.dehne.net

Lengthy Case Analysis...

CASE:

  • Insert creates a

directory node split

  • Concurrent OLAP

query returns back up the tree and finds tree structure changed.

1 D1 4 D1 1 2 3 2 3 D2 D3 D2 D3 D4

New node gets

  • ld time stamp
slide-27
SLIDE 27

Frank Dehne ■ www.dehne.net

Parallel DC-Tree Performance

Architecture:

  • Intel Xeon Westmere EX
  • 20 Cores (2 Sockets)
  • 40 Hardware Threads

(Hyperthreading)

  • 256 GB Memory

IBM Research Labs, Toronto

slide-28
SLIDE 28

Frank Dehne ■ www.dehne.net

Parallel DC-Tree Performance

Data:

  • Transaction Processing

Performance Council Decision Support Benchmark (TPC-DS)

tpc.org

slide-29
SLIDE 29

Frank Dehne ■ www.dehne.net

TPC-DS Benchmark

8 Dimensions Hierarchy Levels

slide-30
SLIDE 30

Frank Dehne ■ www.dehne.net

Performance

  • 100 GB data set

(10 Mil. Records)

  • 10,000 queries
  • 1,000 insertions

OLAP Query Response Time

slide-31
SLIDE 31

Frank Dehne ■ www.dehne.net

Performance

  • 100 GB data set

(10 Mil. Records)

  • 10,000 queries
  • 1,000 insertions

Throughput

slide-32
SLIDE 32

Frank Dehne ■ www.dehne.net

Performance

Total Total

Response time 5 sec. -> .25 sec. Response time 2.7 sec. -> .13 sec.

slide-33
SLIDE 33

Frank Dehne ■ www.dehne.net

Performance

Total Total

Response time 5 sec. -> .25 sec. Response time 2.7 sec. -> .13 sec.

IBM CAS Research Impact Of The Year Award

slide-34
SLIDE 34

Frank Dehne ■ www.dehne.net

Real-Time OLAP on Cloud Architectures

slide-35
SLIDE 35

Frank Dehne ■ www.dehne.net

Real-Time OLAP on Cloud Architectures

A B C AB AC BC ABC

Insert & OLAP Query Stream Real-Time OLAP Engine OLAP Query Results

slide-36
SLIDE 36

Frank Dehne ■ www.dehne.net

Cloud Computing Architecture

  • Large scale compute cluster
  • Virtual machines on demand
  • Elastic: Dynamic addition of

compute resources

  • Dedicated storage devices

(e.g. S3 buckets)

slide-37
SLIDE 37

Frank Dehne ■ www.dehne.net

Velocity OLAP (vOLAP) System Architecture

slide-38
SLIDE 38

Frank Dehne ■ www.dehne.net

Velocity OLAP (vOLAP) System Architecture

Subset Di Image I k Client C

Insert

Worker Server Zookeeper: Global Image & Sync

slide-39
SLIDE 39

Frank Dehne ■ www.dehne.net

Velocity OLAP (vOLAP) System Architecture

Image I k Client C

OLAP Query

Worker Server Worker Worker Subsets:

slide-40
SLIDE 40

Frank Dehne ■ www.dehne.net

Subset Data Structure

slide-41
SLIDE 41

Frank Dehne ■ www.dehne.net

Load Balancing

slide-42
SLIDE 42

Frank Dehne ■ www.dehne.net

Insert/Query Stream Serialization

  • Strong serialization of

insert and OLAP query

  • perations within each

session.

  • Strong serialization of

insert and OLAP query

  • perations within sessions

attached to the same server (workgroup).

slide-43
SLIDE 43

Frank Dehne ■ www.dehne.net

Between Servers: Probablistic Serialization

slide-44
SLIDE 44

Frank Dehne ■ www.dehne.net

Between Servers: Probablistic Serialization

  • n = 1 billion
  • 50% coverage (500 million

reported data items)

  • 1 second elapsed time:
  • approx. 1% probablity of

2 missing data items (0.0000004% of the result)

slide-45
SLIDE 45

Frank Dehne ■ www.dehne.net

vOLAP Performance

Architecture:

  • Amazon EC2
  • servers: c3.8xlarge
  • workers: c3.4xlarge
  • clients / manager / Zookeper:

c3.2xlarge

  • Linux 3.14.35, ZeroMQ 4.0.5,

Zookeeper 3.4.6.

slide-46
SLIDE 46

Frank Dehne ■ www.dehne.net

vOLAP Performance

8 Dimensions Hierarchy Levels

Data: TPC-DS

slide-47
SLIDE 47

Frank Dehne ■ www.dehne.net

Horizontal Scale-Up Performance

slide-48
SLIDE 48

Frank Dehne ■ www.dehne.net

Impact of Workload Mix

slide-49
SLIDE 49

Frank Dehne ■ www.dehne.net

Impact of Query Coverage

slide-50
SLIDE 50

Frank Dehne ■ www.dehne.net

Conclusion

Parallel data structures can enable real-time OLAP on multi-core and cloud architectures.

slide-51
SLIDE 51

Frank Dehne ■ www.dehne.net

Publications

  • "vOLAP: A Scalable Cloud-Based System for Real-Time OLAP on

High-Velocity Data", submitted.

  • "Scalable real-time OLAP on cloud architectures", Journal of Parallel

and Distributed Computing (JPDC), Vol. 79-80, pp. 31-41, 2015.

  • "A distributed tree data structure for real-time OLAP on cloud

architectures", in Proc. IEEE Int. Conference on Big Data (IEEE BigData 2013), pp.499-505, 2013.

  • "Parallel real-time OLAP on multi-core processors", in Proc. 12th

IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing (CCGrid 2012), pp. 588-594, 2012.