Column-Stores vs. Row-Stores: How Different Are They Really? Daniel - - PowerPoint PPT Presentation

column stores vs row stores how different are they really
SMART_READER_LITE
LIVE PREVIEW

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel - - PowerPoint PPT Presentation

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008 Daniel Abadi -- Yale University Row vs. Column-Stores Row-Store Column-Store Last


slide-1
SLIDE 1

Daniel Abadi -- Yale University

Column-Stores vs. Row-Stores: How Different Are They Really?

Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12th, 2008

slide-2
SLIDE 2

Daniel Abadi -- Yale University

Row vs. Column-Stores

Street Address Phone # E-mail First Name Last Name Last Name First Name E-mail Phone # Street Address

Row-Store Column-Store

− Might read in

unnecessary data

+ Only need to read

in relevant data

+ Easy to add a new

record

− Tuple writes might require

multiple seeks

slide-3
SLIDE 3

Daniel Abadi -- Yale University

Column-Stores

  • Really good for read-mostly data

warehouses

Lot’s of column scans and aggregations Writes tend to be in batch [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores

Factor of 5 faster on performance Factor of 2 superior on price/performance

slide-4
SLIDE 4

Daniel Abadi -- Yale University

Data Warehouse DBMS Software

  • $4.5 billion industry (out of total $16 billion

DBMS software industry)

  • Growing 10% annually
slide-5
SLIDE 5

Daniel Abadi -- Yale University

Momentum

  • Right solution for growing market $$$$
  • Vertica, ParAccel, Kickfire, Calpont,

Infobright, and Exasol new entrants

  • Sybase IQ’s profits rapidly increasing
  • Yahoo’s world largest (multi-petabyte)

data warehouse is a column-store (from Mahat Technologies acquisition)

slide-6
SLIDE 6

Daniel Abadi -- Yale University

Paper Looks At Key Question

  • How much of the buzz around column-

stores just marketing hype?

Do you really need to buy Sybase IQ or Vertica? How far will your current row-store take you?

Can you get column-store performance from a row- store? Can you simulate a column-store in a row-store?

slide-7
SLIDE 7

Daniel Abadi -- Yale University

Paper Methodology

  • Comparing row-store vs. column-store is

dangerous/borderline meaningless

  • Instead, compare row-store vs. row-store

and column-store vs. column-store

Simulate a column-store inside of a row-store Remove column-oriented features from column-store until it behaves like a row-store

slide-8
SLIDE 8

Daniel Abadi -- Yale University

Simulate Column-Store Inside Row-Store

Street Address Phone # E-mail First Name Last Name Last Name First Name E-mail

1 2 3 1 2 3 1 2 3

Option A: Vertical Partitioning … Option B: Index Every Column

Last Name Index First Name Index

slide-9
SLIDE 9

Daniel Abadi -- Yale University

Experiments

  • Star Schema Benchmark (SSBM)

Fact table contains 17 columns and 60,000,000 rows 4 dimension tables, biggest one has 80,000 rows Queries perform 2-4 joins between fact table and dimension tables, aggregate 1-2 columns from fact table [OOC06]

  • Implemented by professional DBA

Original row-store plus 2 column-store simulations on same row-store product

slide-10
SLIDE 10

Daniel Abadi -- Yale University

SSBM Averages

  • !"
  • #$

%&

slide-11
SLIDE 11

Daniel Abadi -- Yale University

What’s Going On?

  • Vertically Partitioned Case

Tuple Sizes Horizontal Partitioning

  • All Indexes Case

Tuple Reconstruction

slide-12
SLIDE 12

Daniel Abadi -- Yale University

Tuple Size

1 2 3

Column Data TID

1 2 3

TID Column Data

1 2 3

TID Column Data Tuple Header

  • Queries touch 3-4 foreign keys in fact table, 1-2 numeric

columns

  • Complete fact table takes up ~4 GB (compressed)
  • Vertically partitioned tables take up 0.7-1.1 GB

(compressed)

slide-13
SLIDE 13

Daniel Abadi -- Yale University

Horizontal Partitioning

  • Fact table horizontally partitioned on year

Year is an element of the ‘Date’ dimension table Most queries in SSBM have a predicate on year Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them

slide-14
SLIDE 14

Daniel Abadi -- Yale University

What’s Going On?

  • Vertically Partitioned Case

Tuple Sizes Horizontal Partitioning

  • All Indexes Case

Tuple Construction

slide-15
SLIDE 15

Daniel Abadi -- Yale University

Tuple Construction

  • Common type of query:
  • SELECT store_name, SUM(revenue)

FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name

slide-16
SLIDE 16

Daniel Abadi -- Yale University

Tuple Construction

  • Result of lower part of query plan is a set
  • f TIDs that passed all predicates
  • Need to extract SELECT attributes at

these TIDs

BUT: index maps value to TID You really want to map TID to value (i.e., a vertical partition) Tuple construction is SLOW

slide-17
SLIDE 17

Daniel Abadi -- Yale University

So….

  • All indexes approach is a poor way to

simulate a column-store

  • Problems with vertical partitioning are

NOT fundamental

Store tuple header in a separate partition Allow virtual TIDs Allow HP using a foreign key on a different VP

  • So can row-stores simulate column-

stores?

slide-18
SLIDE 18

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

  • '
  • Average

25.7 1 1 .7 4.4 Row-Store Row-Store (M V) C-Store

slide-19
SLIDE 19

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

  • '
  • Average

25.7 1 1 .7 4.4 Row-Store Row-Store (M V) C-Store

slide-20
SLIDE 20

Daniel Abadi -- Yale University

Column-Store Experiments

  • Start with column-store (C-Store)
  • Remove column-store-specific

performance optimizations

  • End with column-store with a row-oriented

query executer

slide-21
SLIDE 21

Daniel Abadi -- Yale University

Compression

  • Higher data value locality

in column-stores

Better ratio reduced I/O

  • Can use schemes like

run-length encoding

Easy to operate on directly for improved performance ([AMF06])

Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 … … Quarter

(Q1, 1, 300)

Quarter

(Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600)

slide-22
SLIDE 22

Daniel Abadi -- Yale University

  • Early Materialization: create

rows first. But:

Poor memory bandwidth utilization Lose opportunity for vectorized operation

2 1 3 1 2 3 3 3 7 13 42 80 Construct 2 3 3 3 7 13 42 80 Select + Aggregate 2 1 3 1 4 4 4 4

prodID storeIDcustID price

QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID

Early vs. Late Materialization

4 4 4 4

slide-23
SLIDE 23

Daniel Abadi -- Yale University

Other Column-Store Optimizations

  • Invisible join

Column-store specific join Optimizations for star schemas Similar to a semi-join

  • Block Processing
slide-24
SLIDE 24

Daniel Abadi -- Yale University

Simplified Version of Results

  • '

(

  • Average

4.4 14.9 40.7 Original C-St ore C-Store, No Compression C-St ore, Early Mat erializat ion

slide-25
SLIDE 25

Daniel Abadi -- Yale University

Conclusion

  • Might be possible to simulate a row-store

in a column-store, BUT:

Need better support for vertical partitioning at the storage layer Need support for column-specific

  • ptimizations at the executer level
  • Working with HP Labs to find out
slide-26
SLIDE 26

Daniel Abadi -- Yale University

Come Join the Yale DB Group!