Column-Stores vs. Row-Stores: How Different Are They Really? Daniel - - PowerPoint PPT Presentation

▶

Aug 16, 2023 226 likes •498 views

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008 Daniel Abadi -- Yale University Row vs. Column-Stores Row-Store Column-Store Last

SLIDE 1

Daniel Abadi -- Yale University

Column-Stores vs. Row-Stores: How Different Are They Really?

Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12th, 2008

SLIDE 2

Daniel Abadi -- Yale University

Row vs. Column-Stores

Street Address Phone # E-mail First Name Last Name Last Name First Name E-mail Phone # Street Address

Row-Store Column-Store

− Might read in

unnecessary data

+ Only need to read

in relevant data

+ Easy to add a new

record

− Tuple writes might require

multiple seeks

SLIDE 3

Daniel Abadi -- Yale University

Column-Stores

Really good for read-mostly data

warehouses

Lot’s of column scans and aggregations Writes tend to be in batch [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores

Factor of 5 faster on performance Factor of 2 superior on price/performance

SLIDE 4

Daniel Abadi -- Yale University

Data Warehouse DBMS Software

$4.5 billion industry (out of total $16 billion

DBMS software industry)

Growing 10% annually

SLIDE 5

Daniel Abadi -- Yale University

Momentum

Right solution for growing market $$$$
Vertica, ParAccel, Kickfire, Calpont,

Infobright, and Exasol new entrants

Sybase IQ’s profits rapidly increasing
Yahoo’s world largest (multi-petabyte)

data warehouse is a column-store (from Mahat Technologies acquisition)

SLIDE 6

Daniel Abadi -- Yale University

Paper Looks At Key Question

How much of the buzz around column-

stores just marketing hype?

Do you really need to buy Sybase IQ or Vertica? How far will your current row-store take you?

Can you get column-store performance from a row- store? Can you simulate a column-store in a row-store?

SLIDE 7

Daniel Abadi -- Yale University

Paper Methodology

Comparing row-store vs. column-store is

dangerous/borderline meaningless

Instead, compare row-store vs. row-store

and column-store vs. column-store

Simulate a column-store inside of a row-store Remove column-oriented features from column-store until it behaves like a row-store

SLIDE 8

Daniel Abadi -- Yale University

Simulate Column-Store Inside Row-Store

Street Address Phone # E-mail First Name Last Name Last Name First Name E-mail

1 2 3 1 2 3 1 2 3

Option A: Vertical Partitioning … Option B: Index Every Column

Last Name Index First Name Index

SLIDE 9

Daniel Abadi -- Yale University

Experiments

Star Schema Benchmark (SSBM)

Fact table contains 17 columns and 60,000,000 rows 4 dimension tables, biggest one has 80,000 rows Queries perform 2-4 joins between fact table and dimension tables, aggregate 1-2 columns from fact table [OOC06]

Implemented by professional DBA

Original row-store plus 2 column-store simulations on same row-store product

SLIDE 10

Daniel Abadi -- Yale University

SSBM Averages

SLIDE 11

Daniel Abadi -- Yale University

What’s Going On?

Vertically Partitioned Case

Tuple Sizes Horizontal Partitioning

All Indexes Case

Tuple Reconstruction

SLIDE 12

Daniel Abadi -- Yale University

Tuple Size

1 2 3

Column Data TID

1 2 3

TID Column Data

1 2 3

TID Column Data Tuple Header

Queries touch 3-4 foreign keys in fact table, 1-2 numeric

columns

Complete fact table takes up ~4 GB (compressed)
Vertically partitioned tables take up 0.7-1.1 GB

(compressed)

SLIDE 13

Daniel Abadi -- Yale University

Horizontal Partitioning

Fact table horizontally partitioned on year

Year is an element of the ‘Date’ dimension table Most queries in SSBM have a predicate on year Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them

SLIDE 14

Daniel Abadi -- Yale University

What’s Going On?

Vertically Partitioned Case

Tuple Sizes Horizontal Partitioning

All Indexes Case

Tuple Construction

SLIDE 15

Daniel Abadi -- Yale University

Tuple Construction

Common type of query:
SELECT store_name, SUM(revenue)

FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name

SLIDE 16

Daniel Abadi -- Yale University

Tuple Construction

Result of lower part of query plan is a set
f TIDs that passed all predicates
Need to extract SELECT attributes at

these TIDs

BUT: index maps value to TID You really want to map TID to value (i.e., a vertical partition) Tuple construction is SLOW

SLIDE 17

Daniel Abadi -- Yale University

So….

All indexes approach is a poor way to

simulate a column-store

Problems with vertical partitioning are

NOT fundamental

Store tuple header in a separate partition Allow virtual TIDs Allow HP using a foreign key on a different VP

So can row-stores simulate column-

stores?

SLIDE 18

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

'
Average

25.7 1 1 .7 4.4 Row-Store Row-Store (M V) C-Store

SLIDE 19

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

'
Average

25.7 1 1 .7 4.4 Row-Store Row-Store (M V) C-Store

SLIDE 20

Daniel Abadi -- Yale University

Column-Store Experiments

Start with column-store (C-Store)
Remove column-store-specific

performance optimizations

End with column-store with a row-oriented

query executer

SLIDE 21

Daniel Abadi -- Yale University

Compression

Higher data value locality

in column-stores

Better ratio reduced I/O

Can use schemes like

run-length encoding

Easy to operate on directly for improved performance ([AMF06])

Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 … … Quarter

(Q1, 1, 300)

Quarter

(Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600)

SLIDE 22

Daniel Abadi -- Yale University

Early Materialization: create

rows first. But:

Poor memory bandwidth utilization Lose opportunity for vectorized operation

2 1 3 1 2 3 3 3 7 13 42 80 Construct 2 3 3 3 7 13 42 80 Select + Aggregate 2 1 3 1 4 4 4 4

prodID storeIDcustID price

QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID

Early vs. Late Materialization

4 4 4 4

SLIDE 23

Daniel Abadi -- Yale University

Other Column-Store Optimizations

Invisible join

Column-store specific join Optimizations for star schemas Similar to a semi-join

Block Processing

SLIDE 24

Daniel Abadi -- Yale University

Simplified Version of Results

(

Average

4.4 14.9 40.7 Original C-St ore C-Store, No Compression C-St ore, Early Mat erializat ion

SLIDE 25

Daniel Abadi -- Yale University

Conclusion

Might be possible to simulate a row-store

in a column-store, BUT:

Need better support for vertical partitioning at the storage layer Need support for column-specific

ptimizations at the executer level
Working with HP Labs to find out

SLIDE 26

Daniel Abadi -- Yale University

Column-Stores vs. Row-Stores: How Different Are They Really?

Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12th, 2008

Row vs. Column-Stores

Row-Store Column-Store

− Might read in

+ Only need to read

+ Easy to add a new

− Tuple writes might require

Column-Stores

warehouses

Lot’s of column scans and aggregations Writes tend to be in batch [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores

Factor of 5 faster on performance Factor of 2 superior on price/performance

Data Warehouse DBMS Software

DBMS software industry)

Momentum

Infobright, and Exasol new entrants

data warehouse is a column-store (from Mahat Technologies acquisition)

Paper Looks At Key Question

stores just marketing hype?

Do you really need to buy Sybase IQ or Vertica? How far will your current row-store take you?

Can you get column-store performance from a row- store? Can you simulate a column-store in a row-store?

Paper Methodology

dangerous/borderline meaningless

and column-store vs. column-store

Simulate a column-store inside of a row-store Remove column-oriented features from column-store until it behaves like a row-store

Simulate Column-Store Inside Row-Store

1 2 3 1 2 3 1 2 3

Experiments

Fact table contains 17 columns and 60,000,000 rows 4 dimension tables, biggest one has 80,000 rows Queries perform 2-4 joins between fact table and dimension tables, aggregate 1-2 columns from fact table [OOC06]

Original row-store plus 2 column-store simulations on same row-store product

SSBM Averages

What’s Going On?

Tuple Sizes Horizontal Partitioning

Tuple Reconstruction

Tuple Size

1 2 3

1 2 3

1 2 3

columns

(compressed)

Horizontal Partitioning

Year is an element of the ‘Date’ dimension table Most queries in SSBM have a predicate on year Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them

What’s Going On?

Tuple Sizes Horizontal Partitioning

Tuple Construction

Tuple Construction

FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name

Tuple Construction

these TIDs

BUT: index maps value to TID You really want to map TID to value (i.e., a vertical partition) Tuple construction is SLOW

So….

simulate a column-store

NOT fundamental

Store tuple header in a separate partition Allow virtual TIDs Allow HP using a foreign key on a different VP

stores?

Row-Store vs. Column-Store

Row-Store vs. Column-Store

Column-Store Experiments

performance optimizations

query executer

Compression

in column-stores

Better ratio reduced I/O

run-length encoding

Easy to operate on directly for improved performance ([AMF06])

rows first. But:

Poor memory bandwidth utilization Lose opportunity for vectorized operation

Early vs. Late Materialization

Other Column-Store Optimizations

Column-store specific join Optimizations for star schemas Similar to a semi-join

Simplified Version of Results

Conclusion

in a column-store, BUT:

Need better support for vertical partitioning at the storage layer Need support for column-specific

Come Join the Yale DB Group!