[PPT] - BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON PowerPoint Presentation

SLIDE 1

BY THEIR FRUITS SHALL YE KNOW THEM

A DATA ANALYST’S PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN

Holger Pirk Sam Madden Mike Stonebraker

SLIDE 2

SLIDE 3

A CRUCIAL DISTINCTION

≠

SLIDE 4

INSPIRATION

SLIDE 5

MY PLEDGE OF LOYALTY

SLIDE 6

SCIENTIFIC RATIONALE

SLIDE 7

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

Processing 500GB/s

GENE AMDAHL TAUGHT US THAT SYSTEMS NEED TO BE BALANCED

SLIDE 8

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia

Processing 500GB/s

NVIDIA AND AMD PROCESS LOT OF SMALL DATA WORDS

SLIDE 9

Memory

SIMT Cores

Instruction Scheduler

SIMT

SLIDE 10

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia Intel

Processing 500GB/s

INTEL PROCESSES FEWER LARGE DATAWORDS

SLIDE 11

Memory SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core

MANY

CORE SIMD

Pentium Cores 512 Bits

SLIDE 12

Memory SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core Scatter/Gather Unit

SIMD WITH SCATTER/GATHER

SLIDE 13

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia Intel

Processing 500GB/s

ALL OF THEM CAN PROCESS WAY MORE DATA THAN THEY CAN LOAD

SLIDE 14

SPEC BANDWIDTH-WISE, PHI OUTPERFORMS CURRENT GPUS

GB/s Memory Bandwidth 100 200 300 400 Phi GTX 780

SLIDE 15

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia Intel

Processing 500GB/s

OUR QUESTION: DOES IT MATTER? DOES PHI CHANGE ANYTHING?

SLIDE 16

THE OBSTACLE COURSE

SLIDE 17

Facts Dimension

π

Ɣ

DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS

Bandwidth Computation Synchronization Capacity

SLIDE 18

DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS

Facts Dimension

π

Ɣ

Tuple Width # of Conflicts Hash Complexity Access Locality

SLIDE 19

PHI VS. GTX 780

SLIDE 20

Facts Dimension

π

Ɣ

Bandwidth

FIRST CHOKEPOINT

SLIDE 21

4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi

BANDWIDTH OF PHI LOOKS SIMILAR TO GPU AT FIRST GLANCE

SLIDE 22

4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi

A SECOND GLANCE REVEALS SOMETHING ODD…

A Non-Linear Cost Function

SLIDE 23

4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi

A SECOND GLANCE REVEALS SOMETHING ODD…

Not Dominated (only) by Cache Misses

SLIDE 24

Facts Dimension

π

Ɣ

Capacity

SECOND CHOKEPOINT

SLIDE 25

64 512 4K 32K 256K 2M 16M Size of Lookup Table in Bytes 0.02 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi Xeon Phi Lower Bound GTX 780 Lower Bound

PHI BENEFITS FROM LARGER CACHES

SLIDE 26

Facts Dimension

π

Ɣ

Computation

THIRD CHOKEPOINT

SLIDE 27

1 2 4 8 16 32 Number of Murmur Rehashes 0.05 0.10 0.20 0.40 0.80 Time per hash in ns Xeon Phi GTX 780

COMPUTATION PERFORMANCE IS VERY SIMILAR…

SLIDE 28

Facts Dimension

π

Ɣ

Synchronization

THIRD CHOKEPOINT

SLIDE 29

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of Values per Bucket 0.0 5.0 10.0 15.0 Time per Access in ns GTX 780 Xeon Phi

…AND SO IS HASH-BUILDING

SLIDE 30

RECAP

Phi & GPU mostly en par in
Computation
Synchronization
Cache-Utilization
But what is up with the memory access

SLIDE 31

PHI IN DEPTH

SLIDE 32

SCATTER/GATHER

SLIDE 33 CHAPTER 6. INSTRUCTION DESCRIPTIONS VGATHERDPD - Gather Float64 Vector With Signed Dword Indices Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, Uf64(mvt) Gather loat64 vector Uf64(mvt) into loat64 vector zmm1 using doubleword indices and k1 as completion mask. Description A set of 8 memory locations pointed by base address BASE_ADDR and doubleword index vector V INDEX with scale SCALE are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function SELECT_SUBSET). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & (∼0x3F) and (element_linear_address & (∼0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size

f a single vector element before up-conversion.

Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector V INDEX. Operation mvt Reference Number: 327364-001 297

LET’S LOOK AT THE DOCUMENTATION

SLIDE 34

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERDPD - Gather Float64 Vector With Signed Dword Indices

Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, Uf64(mvt) Gather loat64 vector Uf64(mvt) into loat64 vector zmm1 using doubleword indices and k1 as completion mask.

Description

A set of 8 memory locations pointed by base address _ and doubleword index vector with scale are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function _ ). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( 0x3F) and (element_linear_address & ( 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size

f a single vector element before up-conversion.

Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector .

Operation

Reference Number: 327364-001

297

LET’S LOOK AT THE DOCUMENTATION

???

SLIDE 35

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERDPD - Gather Float64 Vector With Signed Dword Indices

Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, ( ) Gather loat64 vector ( ) into loat64 vector zmm1 using doubleword indices and k1 as completion mask.

Description

A set of 8 memory locations pointed by base address BASE_ADDR and doubleword index vector V INDEX with scale SCALE are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function SELECT_SUBSET). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( 0x3F) and (element_linear_address & ( 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size

f a single vector element before up-conversion.

Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector .

Operation

Reference Number: 327364-001

297

LET’S LOOK AT THE DOCUMENTATION

??? !☠"

SLIDE 36

8 64 512 4K 32K 256K 2M 16M Size of Lookup Table in Bytes 0.06 0.13 0.25 0.50 1.00 2.00 4.00 Time per Access in ns Scalar Vectorized Ratio

GATHER-LOADING ONLY YIELDS MODERATE LOOKUP IMPROVEMENT…

SLIDE 37

4 8 16 32 64 128 256 512 Stride in Bytes 0.03 0.06 0.13 0.25 0.50 1.00 2.00 4.00 Time per Access in ns Scalar Vectorized Ratio

…SAME FOR PROJECTIONS

SLIDE 38

PREFETCHING

SLIDE 39

4 8 16 32 64 128 256 512 1K 2K 4K 0.03 0.06 0.13 0.25 0.50 1.00 2.00

THE PHI PREFETCHER SEEMS OVERLY AGGRESSIVE

With Prefetcher Bypassing Prefetcher

Expected Behavior

Stride in Bytes

Overhead

SLIDE 40

4 8 16 32 64 128 256 512 1K 2K 4K Stride in Bytes 20 40 80 160 320 GB/s Cache-Overhead Adjusted Transfer Rate

ONLY WHEN FACTORING IN TRANSFER OVERHEAD IS THE NOMINAL PHI BANDWIDTH ACHIEVED

SLIDE 41

TAKEAWAY

Phi is en-par with mid-level GPUs compute-intensive applications
Data-intensive performance is weird, though:
Prefetcher seems overly aggressive
Gather implementation seems half-baked: to few cache ports?

SLIDE 42

BY THEIR FRUITS SHALL YE KNOW THEM

A CRUCIAL DISTINCTION

≠

INSPIRATION

MY PLEDGE OF LOYALTY

SCIENTIFIC RATIONALE

GENE AMDAHL TAUGHT US THAT SYSTEMS NEED TO BE BALANCED

NVIDIA AND AMD PROCESS LOT OF SMALL DATA WORDS

SIMT

INTEL PROCESSES FEWER LARGE DATAWORDS

MANY

SIMD WITH SCATTER/GATHER

ALL OF THEM CAN PROCESS WAY MORE DATA THAN THEY CAN LOAD

SPEC BANDWIDTH-WISE, PHI OUTPERFORMS CURRENT GPUS

OUR QUESTION: DOES IT MATTER? DOES PHI CHANGE ANYTHING?

THE OBSTACLE COURSE

π

DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS

DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS

π

PHI VS. GTX 780

π

FIRST CHOKEPOINT

BANDWIDTH OF PHI LOOKS SIMILAR TO GPU AT FIRST GLANCE

A SECOND GLANCE REVEALS SOMETHING ODD…

A SECOND GLANCE REVEALS SOMETHING ODD…

π

SECOND CHOKEPOINT

PHI BENEFITS FROM LARGER CACHES

π

THIRD CHOKEPOINT

COMPUTATION PERFORMANCE IS VERY SIMILAR…

π

THIRD CHOKEPOINT

…AND SO IS HASH-BUILDING

RECAP

PHI IN DEPTH

SCATTER/GATHER

LET’S LOOK AT THE DOCUMENTATION

LET’S LOOK AT THE DOCUMENTATION

LET’S LOOK AT THE DOCUMENTATION

GATHER-LOADING ONLY YIELDS MODERATE LOOKUP IMPROVEMENT…

…SAME FOR PROJECTIONS

PREFETCHING

THE PHI PREFETCHER SEEMS OVERLY AGGRESSIVE

TAKEAWAY

THANK YOU