NUMA obliviousness through memory mapping Mrunal Gawade - PowerPoint PPT Presentation
NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz Memory mapping What is it? Operating system maps
NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia
NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz
Memory mapping What is it? Operating system maps disk files to memory E.g. Executable file mapping How is it done? System call – mmap(), munmap() Relevance for the Database world? In memory columnar storage disk files mapped to memory
Motivation Memory mapped columnar storage and NUMA effects, in analytic workload
TPC-H Q1 … (4 sockets, 100GB, MonetDB) 3 5 3 0 2 5 ) c e s 2 0 ( e m 1 5 i T 1 0 5 0 0 , 1 0 0 , 2 1 , 2 2 0 , 3 3 S o c k e t s o n w h i c h m e m o r y i s a l l o c a t e d numactl -N 0,1 -m “Varied between sockets 0-3” “Database server process”
Contributions NUMA oblivious (shared-everything) is relatively good compared to NUMA aware (shared-nothing). (using SQL workload) Effect of memory mapping on NUMA obliviousness insights. (using micro-benchmarks) Distributed database system using multi-sockets (shared- nothing) reduces remote memory accesses.
NUMA oblivious vs NUMA aware plans NUMA_Shard- (Variation of NUMA_Distr- (shared nothing) NUMA_Obliv- (shared NUMA_Obliv) everything) Socket aware plans in MonetDB Shard aware plans in Default parallel plans in “Lineitem” and “Orders” table MonetDB MonetDB sharded in 4 pieces(orderkey), “Lineitem” and “Orders” table and sliced Only “Lineitem” table is sharded in 4 pieces sliced Dimension tables replicated (orderkey) and sliced
System configuration Intel Xeon E5-4657L v2 @2.40GHz, 4 sockets, 12 cores per socket (total 96 threads with Hyper-threading) Cache - L1=32KB, L2 =256KB, shared L3=30MB. 1TB four channel DDR3 memory, (256 GB memory / socket). O.S. - Fedora 20 Data-set- TPC-H 100GB Tools – numactl, Intel PCM, Linux Perf MonetDB open-source system with memory mapped columnar storage
TPC-H performance 6 N U M A _ O b l i v 5 N U M A _ S h a r d N U M A _ D i s t r ) 4 c e s ( 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s NUMA_Shard is a variation of NUMA_Obliv with sharded & partitioned “orders” table.
Micro-experiments on modified Q6 Why Q6? - select count(*) from lineitem where l_quantity > 24000000; Selection on “lineitem” table Easily parallelizable NUMA effects analysis is easy (read only query)
Process and memory affinity Example: numactl -C 0-11,12-23,24-35 -m 0,1,2 “Database Server” Socket 0 Socket 1 Socket 2 Socket 3 cores 0-11 12-23 24-35 36-47 cores 48-59 60-71 72-83 84-95
NUMA_Obliv Micro-experiments on Q6
Local vs Remote memory access 8 0 8 0 8 0 s s s n n n L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s o o o i 7 0 i 7 0 i 7 0 l l l R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s l l l i i i M M M 6 0 6 0 6 0 n n n i i i s 5 0 s 5 0 s 5 0 e e e s s s s s s 4 0 4 0 4 0 e e e c c c c c c 3 0 3 0 3 0 a a a y y y r r r 2 0 o 2 0 2 0 o o m m m e 1 0 e e 1 0 1 0 M M M 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)
Execution time (Robustness) 3 5 0 3 5 0 3 5 0 3 0 0 3 0 0 ) 3 0 0 ) s s d d d n n n o 2 5 0 2 5 0 2 5 0 o c o c e c s e e - 2 0 0 s 2 0 0 i s 2 0 0 l - - l i i i l l l l mi i m ( 1 5 0 1 5 0 m 1 5 0 ( ( e me e m i 1 0 0 1 0 0 1 0 0 i m T i T T 5 0 5 0 5 0 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Most robust Less robust Least robust Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)
Increase in threads = more remote accesses?
Distribution of mapped pages 1 0 0 s s o c k e t 3 e g s o c k e t 2 a s o c k e t 1 p 8 0 d s o c k e t 0 e p p 6 0 a m f o 4 0 n o i t r o 2 0 p o r P 0 1 2 2 4 3 6 4 8 N u m b e r o f t h r e a d s /proc/process id/numa maps
# CPU migrations 2 0 0 s n 1 5 0 o i t a r g i m 1 0 0 U P C # 5 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s
Why remote accesses are bad? 1 6 0 1 4 0 ) s d 1 2 0 n #Local Access # Remote Access o c 1 0 0 e s NUMA_Obliv 69 Million (M) 136 M - i l l 8 0 i m ( 6 0 e NUMA_Distr 196 M 9 M m i 4 0 T 2 0 0 N U M A _ O b l i vN U M A _ D i s t r M o d i fj e d T P C - H Q 6
NUMA_Distr to minimize remote accesses ?
Comparison with Vectorwise 6 M o n e t D B N U M A _ S h a r d 5 M o n e t D B N U M A _ D i s t r ) 4 c V e c t o r _ D e f e s ( V e c t o r _ D i s t r 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s Vectorwise has no NUMA awareness and also uses a dedicated buffer manager
Comparison with Hyper 3 . 5 M o n e t D B N U M A _ D i s t r 3 H y p e r 2 . 5 ) 1.15 c e s 2 ( e m 1 . 5 i T 2.3 1 5.7 2.5 0 . 5 2 0 4 6 9 1 2 1 4 1 5 1 9 T P C - H Q u e r i e s The RED numbers indicate speed-up of Hyper over MonetDB NUMA_Distr plans. Hyper generates NUMA aware, LLVM JIT compiled fused operator pipeline plans.
Conclusion ● NUMA obliviousness fares reasonably to NUMA awareness. ● Process and memory affinity helps NUMA oblivious plans to perform robustly. ● Simple distributed shared nothing database configuration can compete with the state of the art database.
Thank you
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.