SLIDE 10 Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen
Characteristics of element kernels
Operator evaluation: 30–70% of arithmetic peak
◮ Vectorization over elements
◮ data layout: array-of-struct-of-array ◮ intrinsics via C++ template class
VectorizedArray<double>
◮ everything but vector read/write fully
vectorized (DG: fully vectorized)
◮ Loop bounds known at compile time
◮ element degree and number of 1D
quadrature points as templates
◮ unroll loops of length ⌊p/2⌋ + 1 or p + 1
◮ Challenge: Mix of operations
◮ memory intensive: vector read/write,
pre-evaluated Jacobians, coefficients, etc.
◮ arithmetic intensive: sum factorization
→ Develop performance model!
. . . vmovapd 0x988(%rsp),%ymm4 vmovapd 0xbc8(%rsp),%ymm6 vmovapd 0x9a8(%rsp),%ymm14 vsubpd %ymm4,%ymm6,%ymm11 vmulpd %ymm15,%ymm6,%ymm3 vmulpd %ymm6,%ymm13,%ymm8 vfmadd231pd %ymm13,%ymm4,%ymm3 vfmadd231pd %ymm4,%ymm15,%ymm8 vmulpd %ymm6,%ymm12,%ymm10 vmovapd 0xbe8(%rsp),%ymm0 vmovapd 0x268(%rsp),%ymm9 vfmsub231pd %ymm9,%ymm4,%ymm10 vmulpd %ymm9,%ymm6,%ymm9 vmovapd 0xaa8(%rsp),%ymm2 vfmsub231pd %ymm4,%ymm12,%ymm9 vmovapd %ymm11,0x748(%rsp) vmovapd 0xac8(%rsp),%ymm1 vmulpd %ymm15,%ymm0,%ymm5 vmulpd %ymm0,%ymm13,%ymm4 vfmadd231pd %ymm7,%ymm2,%ymm3 vfmadd231pd %ymm7,%ymm2,%ymm8 vfmadd231pd %ymm13,%ymm14,%ymm5 vmovapd %ymm3,0x2c8(%rsp) vsubpd %ymm14,%ymm0,%ymm3 . . . ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary