BY THEIR FRUITS SHALL YE KNOW THEM
A DATA ANALYST’S PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN
Holger Pirk Sam Madden Mike Stonebraker
BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON - - PowerPoint PPT Presentation
BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN Holger Pirk Sam Madden Mike Stonebraker A CRUCIAL DISTINCTION INSPIRATION MY PLEDGE OF LOYALTY SCIENTIFIC RATIONALE GENE AMDAHL
A DATA ANALYST’S PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN
Holger Pirk Sam Madden Mike Stonebraker
Processed Instructions per Second
1T 1P 1E
Processed Bytes per Instruction
1 10 100
Processing 500GB/s
Processed Instructions per Second
1T 1P 1E
Processed Bytes per Instruction
1 10 100
AMD Nvidia
Processing 500GB/s
Memory
SIMT Cores
Instruction Scheduler
Processed Instructions per Second
1T 1P 1E
Processed Bytes per Instruction
1 10 100
AMD Nvidia Intel
Processing 500GB/s
Memory SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core
Pentium Cores 512 Bits
Memory SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core Scatter/Gather Unit
Processed Instructions per Second
1T 1P 1E
Processed Bytes per Instruction
1 10 100
AMD Nvidia Intel
Processing 500GB/s
Processed Instructions per Second
1T 1P 1E
Processed Bytes per Instruction
1 10 100
AMD Nvidia Intel
Processing 500GB/s
Facts Dimension
Ɣ
Bandwidth Computation Synchronization Capacity
Facts Dimension
Ɣ
Tuple Width # of Conflicts Hash Complexity Access Locality
Facts Dimension
Ɣ
Bandwidth
4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi
4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi
A Non-Linear Cost Function
4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi
Not Dominated (only) by Cache Misses
Facts Dimension
Ɣ
Capacity
64 512 4K 32K 256K 2M 16M Size of Lookup Table in Bytes 0.02 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi Xeon Phi Lower Bound GTX 780 Lower Bound
Facts Dimension
Ɣ
Computation
1 2 4 8 16 32 Number of Murmur Rehashes 0.05 0.10 0.20 0.40 0.80 Time per hash in ns Xeon Phi GTX 780
Facts Dimension
Ɣ
Synchronization
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of Values per Bucket 0.0 5.0 10.0 15.0 Time per Access in ns GTX 780 Xeon Phi
CHAPTER 6. INSTRUCTION DESCRIPTIONS
VGATHERDPD - Gather Float64 Vector With Signed Dword Indices
Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, Uf64(mvt) Gather loat64 vector Uf64(mvt) into loat64 vector zmm1 using doubleword indices and k1 as completion mask.
Description
A set of 8 memory locations pointed by base address _ and doubleword index vector with scale are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function _ ). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( 0x3F) and (element_linear_address & ( 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size
Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector .
Operation
Reference Number: 327364-001
297
???
CHAPTER 6. INSTRUCTION DESCRIPTIONS
VGATHERDPD - Gather Float64 Vector With Signed Dword Indices
Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, ( ) Gather loat64 vector ( ) into loat64 vector zmm1 using doubleword indices and k1 as completion mask.
Description
A set of 8 memory locations pointed by base address BASE_ADDR and doubleword index vector V INDEX with scale SCALE are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function SELECT_SUBSET). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( 0x3F) and (element_linear_address & ( 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size
Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector .
Operation
Reference Number: 327364-001
297
??? !☠"
8 64 512 4K 32K 256K 2M 16M Size of Lookup Table in Bytes 0.06 0.13 0.25 0.50 1.00 2.00 4.00 Time per Access in ns Scalar Vectorized Ratio
4 8 16 32 64 128 256 512 Stride in Bytes 0.03 0.06 0.13 0.25 0.50 1.00 2.00 4.00 Time per Access in ns Scalar Vectorized Ratio
4 8 16 32 64 128 256 512 1K 2K 4K 0.03 0.06 0.13 0.25 0.50 1.00 2.00
With Prefetcher Bypassing Prefetcher
Expected Behavior
Stride in Bytes
Overhead
4 8 16 32 64 128 256 512 1K 2K 4K Stride in Bytes 20 40 80 160 320 GB/s Cache-Overhead Adjusted Transfer Rate
ONLY WHEN FACTORING IN TRANSFER OVERHEAD IS THE NOMINAL PHI BANDWIDTH ACHIEVED