NUMA obliviousness through memory mapping Mrunal Gawade - - PowerPoint PPT Presentation

▶

Aug 01, 2023 586 likes •838 views

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz Memory mapping What is it? Operating system maps

SLIDE 1

NUMA obliviousness through memory mapping

Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1st June 2015) Melbourne, Australia

SLIDE 2

NUMA architecture

Intel Xeon E5-4657L v2 @2.40GHz

SLIDE 3

Memory mapping

What is it? Operating system maps disk files to memory E.g. Executable file mapping How is it done? System call – mmap(), munmap() Relevance for the Database world? In memory columnar storage disk files mapped to memory

SLIDE 4

Motivation

Memory mapped columnar storage and NUMA effects, in analytic workload

SLIDE 5

TPC-H Q1 … (4 sockets, 100GB, MonetDB)

5 1 1 5 2 2 5 3 3 5 , 1 , 2 1 , 2 2 , 3 3 T i m e ( s e c ) S

k e t s

w h i c h m e m

y i s a l l

a t e d

numactl -N 0,1 -m “Varied between sockets 0-3” “Database server process”

SLIDE 6

Contributions

 NUMA oblivious (shared-everything) is relatively good

compared to NUMA aware (shared-nothing). (using SQL workload)

 Effect of memory mapping on NUMA obliviousness

insights. (using micro-benchmarks)

 Distributed database system using multi-sockets (shared-

nothing) reduces remote memory accesses.

SLIDE 7

NUMA oblivious vs NUMA aware plans



NUMA_Obliv- (shared everything) Default parallel plans in MonetDB Only “Lineitem” table is sliced



NUMA_Shard- (Variation of NUMA_Obliv) Shard aware plans in MonetDB “Lineitem” and “Orders” table sharded in 4 pieces (orderkey) and sliced



NUMA_Distr- (shared nothing) Socket aware plans in MonetDB “Lineitem” and “Orders” table sharded in 4 pieces(orderkey), and sliced Dimension tables replicated

SLIDE 8

System configuration

 Intel Xeon E5-4657L v2 @2.40GHz, 4 sockets, 12 cores per socket (total 96

threads with Hyper-threading)

 Cache - L1=32KB, L2 =256KB, shared L3=30MB.  1TB four channel DDR3 memory, (256 GB memory / socket).  O.S. - Fedora 20 Data-set- TPC-H 100GB  Tools – numactl, Intel PCM, Linux Perf  MonetDB open-source system with memory mapped columnar storage

SLIDE 9

TPC-H performance

1 2 3 4 5 6 4 6 1 5 1 9 T i m e ( s e c ) T P C

Q u e r i e s

N U M A _ O b l i v N U M A _ S h a r d N U M A _ D i s t r

NUMA_Shard is a variation of NUMA_Obliv with sharded & partitioned “orders” table.

SLIDE 10

Micro-experiments on modified Q6

Why Q6? - select count(*) from lineitem where l_quantity > 24000000;

 Selection on “lineitem” table  Easily parallelizable  NUMA effects analysis is easy (read only query)

SLIDE 11

Process and memory affinity

Socket 0 Socket 1 Socket 2 Socket 3 cores 0-11 12-23 24-35 36-47 cores 48-59 60-71 72-83 84-95

Example:

numactl -C 0-11,12-23,24-35 -m 0,1,2 “Database Server”

SLIDE 12

NUMA_Obliv Micro-experiments on Q6

SLIDE 13

1 2 3 4 5 6 7 8 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 M e m

y a c c e s s e s i n M i l l i

s N u m b e r

t h r e a d s

a l m e m

y a c c e s s R e m

e m e m

y a c c e s s

Local vs Remote memory access

1 2 3 4 5 6 7 8 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 M e m

y a c c e s s e s i n M i l l i

s N u m b e r

t h r e a d s

a l m e m

y a c c e s s R e m

e m e m

y a c c e s s

1 2 3 4 5 6 7 8 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 M e m

y a c c e s s e s i n M i l l i

s N u m b e r

t h r e a d s

a l m e m

y a c c e s s R e m

e m e m

y a c c e s s

Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no

SLIDE 14

Execution time (Robustness)

5 1 1 5 2 2 5 3 3 5 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 T i m e ( m i l l i

e c

d N u m b e r

t h r e a d s 5 1 1 5 2 2 5 3 3 5 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 T i me ( mi l l i

e c

d s ) N u m b e r

t h r e a d s 5 1 1 5 2 2 5 3 3 5 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 T i m e ( m i l l i

e c

d s ) N u m b e r

t h r e a d s

PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no

Most robust Less robust Least robust Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

SLIDE 15

Increase in threads = more remote accesses?

SLIDE 16

Distribution of mapped pages

2 4 6 8 1 1 2 2 4 3 6 4 8 P r

t i

m a p p e d p a g e s N u m b e r

t h r e a d s s

k e t s

k e t 1 s

k e t 2 s

k e t 3

/proc/process id/numa maps

SLIDE 17

# CPU migrations

5 1 1 5 2 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 # C P U m i g r a t i

s N u m b e r

t h r e a d s

SLIDE 18

Why remote accesses are bad?

2 4 6 8 1 1 2 1 4 1 6 N U M A _ O b l i vN U M A _ D i s t r T i m e ( m i l l i

e c

d s ) M

i fj e d T P C

Q 6

#Local Access # Remote Access NUMA_Obliv 69 Million (M) 136 M NUMA_Distr 196 M 9 M

SLIDE 19

NUMA_Distr to minimize remote accesses ?

SLIDE 20

Comparison with Vectorwise

1 2 3 4 5 6 4 6 1 5 1 9 T i m e ( s e c ) T P C

Q u e r i e s M

e t D B N U M A _ S h a r d M

e t D B N U M A _ D i s t r V e c t

_ D e f V e c t

_ D i s t r

Vectorwise has no NUMA awareness and also uses a dedicated buffer manager

SLIDE 21

Comparison with Hyper

. 5 1 1 . 5 2 2 . 5 3 3 . 5 4 6 9 1 2 1 4 1 5 1 9 T i m e ( s e c ) T P C

Q u e r i e s M

e t D B N U M A _ D i s t r H y p e r

2.5 2 1.15 5.7 2.3 The RED numbers indicate speed-up of Hyper over MonetDB NUMA_Distr plans. Hyper generates NUMA aware, LLVM JIT compiled fused operator pipeline plans.

SLIDE 22

Conclusion

NUMA obliviousness fares reasonably to NUMA awareness.
Process and memory affinity helps NUMA oblivious plans to perform

robustly.

Simple distributed shared nothing database configuration can compete

with the state of the art database.

SLIDE 23

NUMA obliviousness through memory mapping

NUMA architecture

Memory mapping

What is it? Operating system maps disk files to memory E.g. Executable file mapping How is it done? System call – mmap(), munmap() Relevance for the Database world? In memory columnar storage disk files mapped to memory

Motivation

Memory mapped columnar storage and NUMA effects, in analytic workload

TPC-H Q1 … (4 sockets, 100GB, MonetDB)

Contributions

compared to NUMA aware (shared-nothing). (using SQL workload)

nothing) reduces remote memory accesses.

NUMA oblivious vs NUMA aware plans

System configuration

TPC-H performance

Micro-experiments on modified Q6

Why Q6? - select count(*) from lineitem where l_quantity > 24000000;

Process and memory affinity

Socket 0 Socket 1 Socket 2 Socket 3 cores 0-11 12-23 24-35 36-47 cores 48-59 60-71 72-83 84-95

Example:

numactl -C 0-11,12-23,24-35 -m 0,1,2 “Database Server”

NUMA_Obliv Micro-experiments on Q6

Local vs Remote memory access

Execution time (Robustness)

Increase in threads = more remote accesses?

Distribution of mapped pages

# CPU migrations

Why remote accesses are bad?

NUMA_Distr to minimize remote accesses ?

Comparison with Vectorwise

Comparison with Hyper

Conclusion

Thank you