[PPT] - An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix PowerPoint Presentation

SLIDE 1

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply

Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc

Computational Science & Engineering, Georgia Institute of Technology

19th Nov 2015

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 1 / 33

SLIDE 2

The problem

J K I F F J K

Y=X xn U

I

TTM

X Y U

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 2 / 33

SLIDE 3

The problem

J K I F F J K

Y=X xn U

I JK I F I JK F

Y(n)=UX(n)

GEMM TTM

X Y U Y(n) X(n) U

Transform Transform

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 3 / 33

SLIDE 4

The problem

J K I F F J K

Y=X xn U

I JK I F I JK F

Y(n)=UX(n)

GEMM TTM

X Y U Y(n) X(n) U

Transform Transform

Transform: 70% running time. 50% space. We proposed an in-place TTM algorithm and employed auto-tuning method to adapt its parameters.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 4 / 33

SLIDE 5

Outline

Background Motivation InTensLi Framework Experiments and Analysis Conclusion

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 5 / 33

SLIDE 6

Background

Tensor and Applications

Tensor: interpreted as a multi-dimensional array, e.g. X ∈ RI×J×K.

Special cases: vectors (x) are 1D tensors, and matrices (A)are 2D tensors. Tensor dimension (N): also called mode or order. We focus on dense tensors in this work.

Applications

Quantum chemistry, quantum physics, signal and image processing, neuroscience, and data analytics.

j=1,...,J k = 1 , . . . , K i=1,...,I

A third-order (or three-dimensional) I × J × K tensor.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 6 / 33

SLIDE 7

Background

Tensor Representations

Sub-tensor

X(i, :, :) X(:, j, :) X(:, :, k)

Horizontal Lateral Frontal Slices

I J K

X(:, j, k) X(i, :, k) X(i, j, :) Column Row Tube Fibers

I J K Whole tensor

J=2 K = 2 I=2

1 3 2 4 5 7 6 8 1 2 3 4 5 6 7 8

J=2 IK=4 X X(2)

Matricization Tensorization

Diff representations → Diff algorithms → Diff performance.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 7 / 33

SLIDE 8

Background

Memory Mapping

Tensor organization

Multi-dimensional array – logically Linear storage – physically

Memory mapping1.

J=2 K=2 I=2

1 3 2 4 5 7 6 8 1 2 3 4 6 7 8

X

5

Row-major (LDim: k)

Logical Physical

1 2 3 4 6 7 8

Mapping Function

5

Column-major (LDim: 1)

LDim: Leading Dimension

K -> J -> I I -> J -> K

1GARCIA, R.,and LUMSDAINE, A. Multiarray:A c++ library for generic programming with arrays.Software Practive Experience 35 (2004), 159–188.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 8 / 33

SLIDE 9

Background

Tensor Operations

Matricization, aka unfolding or flattening. Mode-n product, aka tensor-times-matrix multiply (Ttm)

J K I F F J K

TTM on Mode-1

Y=X xn U

I JK I F I JK F

Y(n)=UX(n)

Tensor contraction, Kronecker product, Matricized tensor times Khatri-Rao product (MTTKRP) etc.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 9 / 33

SLIDE 10

Background

Ttm Algorithm

Baseline Ttm algorithm in Tensor Toolbox and Cyclops Tensor Framework (Ctf).

Input: Matricization Tensorization Output: Y(n)=UX(n) Multiplication: Transformation

X Y X(n) U Y(n)

Ttm Applications

Low-rank tensor decomposition. Tucker decomposition, e.g. Tucker-HOOI algorithm. Y = X ×1 A(1)T · · · ×n−1 A(n−1)T ×n+1 A(n+1)T · · · ×N A(N)T.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 10 / 33

SLIDE 11

Background

Main Contributions

Proposed an in-place tensor-times-matrix multiply (InTtm) algorithm, by avoiding physical reorganization of tensors. Built an input-adaptive framework InTensLi to automatically adapt parameters and generate the code. Achieved 4× and 13× speedups compared to the state-of-the-art Tensor Toolbox and Ctf tools.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 11 / 33

SLIDE 12

Motivation

Observation 1: Transformation is expensive.

Notation: the number of words (Q), floating-point operations (W ), last-level cache size (Z). The relation of them is Q ≥

W 8 √ Z − Z 2 for both general matrix-matrix multiply (Gemm) and

Ttm. Suppose Ttm does the same number of flops as Gemm ( ˆ W = W ), the relation of Arithmetic Intensity of Gemm and Ttm: ˆ A ≈ A/(1 + A

m) when counting

transformation. (1 + A

m) is the penalty.

Assume cache size Z is 8MB, the penalty of a 3-D tensor is 33. Conclusion: When Ttm and Gemm do the same number of flops, Arithmetic Intensity of Ttm is decreased by a penalty of 33 or more, as tensor dimension increases.

2G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear
algebra. Acta Numerica, 23:pp. 1–155, 2014.
J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 12 / 33

SLIDE 13

Motivation

Observation 2: Performance of the multiplication in Ttm is far below peak.

Ttm algorithm involves a variety of rectangular problem sizes.

(tiny) (short) (short) (fat) m=16 k

X(n) U

k=In n=I1...In-1In+1...IN

(a) TTM’s multiplication. log2k

1 2 3 4 5 6 7 8 9 10 11 12 13 14

log2n

1 2 3 4 5 6 7 8 9 10 11 12 13 14

20 40 60 80 100 120 140

(b) GEMM performance in Intel MKL with 4 threads.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 13 / 33

SLIDE 14

Motivation

Observation 3: Ttm organization is critical to data locality.

There are many ways to organize data accesses. I F K I F J X(: ,:, k) X(:, j, :) U U

Non-Contigunous Contigunous

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 14 / 33

SLIDE 15

Motivation

Observation 3: Ttm organization is critical to data locality.

There are many ways to organize data accesses. Choose slice representation.

Table 1 : Different representation forms of mode-1 Ttm on a I × J × K tensor.

Mode-1 Product Representation Forms BLAS Level Transformation Full reorganization Tensor representation — — Y = X ×1 U Matrix representation L3 Yes Y(1) = UX(1) Sub-tensor extraction Fiber representation L2 No y(f , :, k) = X(:, :, k)u(f , :), Loops : k = 1, · · · , K, f = 1, · · · , F Slice representation L3 No Y(:, :, k) = UX(:, :, k), Loops : k = 1, · · · , K

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 15 / 33

SLIDE 16

InTensLi Framework Algorithmic Strategy

Layout

1 Background

2 Motivation

3 InTensLi Framework Algorithmic Strategy InTensLi Framework

4 Experiments and Analysis

5 Conclusion

6 References

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 16 / 33

SLIDE 17

InTensLi Framework Algorithmic Strategy

Algorithmic Strategy

...

I1 I2 I3 I4 IN

{

n-1 N-n

{

backward forward

Xsub

I3 I4I5...IN

{

I1I2 Group

To avoid data copy,

Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max{n − 1, N − n} contiguous dimensions without physical reorganization.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 17 / 33

SLIDE 18

InTensLi Framework Algorithmic Strategy

Algorithmic Strategy

...

I1 I2 I3 I4 IN

{

n-1 N-n

{

backward forward

Xsub

I3 I4I5...IN

{

I1I2 Group

To avoid data copy,

Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max{n − 1, N − n} contiguous dimensions without physical reorganization.

To get high performance of Gemm,

Find an approximate matrix size according to computer architecture. Use auto-tuning method in InTensLi framework.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 17 / 33

SLIDE 19

InTensLi Framework Algorithmic Strategy

InTtm Algorithm and Comparison

InTtm’s AI: ˜ A .

ˆ Q

ˆ Q 8 √ Z

= 8 √ Z ≈ A. Traditional Ttm’s AI: ˆ A ≈

A 1+ A

m .

InTtm eliminates the AI by a factor 1 + A

m. In-place Tensor-Times-Matrix Multiply (InTtm) algorithm.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 18 / 33

SLIDE 20

InTensLi Framework Algorithmic Strategy

InTtm Algorithm and Comparison

InTtm’s AI: ˜ A .

ˆ Q

ˆ Q 8 √ Z

= 8 √ Z ≈ A. Traditional Ttm’s AI: ˆ A ≈

A 1+ A

m .

InTtm eliminates the AI by a factor 1 + A

m. In-place Tensor-Times-Matrix Multiply (InTtm) algorithm.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 19 / 33

SLIDE 21

InTensLi Framework InTensLi Framework

Layout

1 Background

2 Motivation

3 InTensLi Framework Algorithmic Strategy InTensLi Framework

4 Experiments and Analysis

5 Conclusion

6 References

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 20 / 33

SLIDE 22

InTensLi Framework InTensLi Framework

InTensLi Framework

Input: tensor features, hardware configuration, and MM benchmark. Parameter estimation

Mode partitioning: ML and MC. Thread allocation: PL and PC.

Code generation

Hardware Parameters Max # of threads MM Benchmark Input Parameters Tensor Thresholds Mode n Data Layout Code Generator ML M

C

P

L

P

C

Parameter Estimator InTTM Code Mode Partition Thread Allocation A ect

parfor i1=1 : I1 parfor i2 = 1 : I2 ... ...

MM Libraries BLIS MKL

Nested loops Matrix-matrix Multiplication

Ysub=UXsub Ysub=XsubU’ OR

InTensLi framework

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 21 / 33

SLIDE 23

InTensLi Framework InTensLi Framework

Parameter Estimation – Mode Partitioning

Decide forward/backward strategy.

Row-major: forward strategy. Column-major: backward strategy.

I1 I2 I3 I4 I6

{

Group 2 3

{

backward forward

Xsub

I3 I4I5I6

{

I1I2 I5

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 22 / 33

SLIDE 24

InTensLi Framework InTensLi Framework

Parameter Estimation – Mode Partitioning

Chosen forward strategy. Group size decides InTtm algorithm.

I1 I2 I3 I4 I6

{

Group size 2 3

{

I5

X1

sub

I3 I4I5I6

{

I1I2

{

I1I2I4 I3 I5I6

X2

sub

3 2

MC ML ML MC

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 23 / 33

SLIDE 25

InTensLi Framework InTensLi Framework

Choosing Group Size

30 60 90 120 14 13 12 11 10 9 8 7 6 5 4 3 2 1 s / P O L F G ( e c n a m r

f

r e P log2n 80% MSTH MLTH

X3

sub

I4I5I6 I5I6 I6

X1

sub

X2

sub

I3 I3

MM benchmark

n 16 x 512 and 512 x n

matrices using 4 threads.

I3

MC={I5,I6} ML={I1,I2,I4}

MSTH and MLTH: Thresholds of Gemm’s size, the size of all the three

perating matrices.

MSTH = 1.04MB and MLTH = 7.04MB in our experiments.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 24 / 33

SLIDE 26

InTensLi Framework InTensLi Framework

Choosing Group Size

30 60 90 120 14 13 12 11 10 9 8 7 6 5 4 3 2 1 s / P O L F G ( e c n a m r

f

r e P log2n 80% MSTH MLTH

X3

sub

I4I5I6 I5I6 I6

X1

sub

X2

sub

I3 I3

MM benchmark

n 16 x 512 and 512 x n

matrices using 4 threads.

I3

MC={I5,I6} ML={I1,I2,I4}

MSTH and MLTH: Thresholds of Gemm’s size, the size of all the three

perating matrices.

MSTH = 1.04MB and MLTH = 7.04MB in our experiments. Decide MC: Use MSTH and MLTH to decide group size, then decide MC. Decide ML: The rest modes of MC, except mode-n.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 24 / 33

SLIDE 27

InTensLi Framework InTensLi Framework

Thread Allocation and Code Generation

Thread allocation

In most cases, maximum performance is obtained by only two configurations:

Small matrices: all threads are allocated to nested loops. Large matrices: all threads are allocated to Gemm operation.

A threshold PTH is set to distinguish the Gemm size, which is 800 KB in our tests.

Code generation

Generate nested loops and wrappers for the Gemm kernel. Code generated in C++, using OpenMP with the collapse directive.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 25 / 33

SLIDE 28

Experiments and Analysis

Experimental Platforms

Double-precision is adopted in our experiments. We employ 8 and 32 threads on the two platforms respectively, considering hyper-threading. Xeon E7-4820 has a relatively large memory (512 GiB), allowing us to test a larger range of (dense) tensor sizes than has been common in prior single-node studies.

Table 2 : Experimental Platforms Configuration

Intel Intel Parameters Core i7-4770K Xeon E7-4820 Microarchitecture Haswell Westmere Frequency 3.5 GHz 2.0 GHz # of physical cores 4 16 Hyper-threading On On Peak GFLOP/s 224 128 Last-level cache 8 GiB 18 GiB Memory size 32 GiB 512 GiB Memory bandwidth 25.6 GB/s 34.2 GB/s # of memory channels 2 4 Compiler icc 15.0.2 icc 15.0.0

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 26 / 33

SLIDE 29

Experiments and Analysis

Performance Comparison

Implementations

InTtm: InTensLi generated C++ code with OpenMP. TT-TTM: Tensor Toolbox library in MATLAB. Ctf: C++ code, supporting MPI+OpenMP parallelization. Gemm: C++ code, baseline Ttm algorithm without transformation.

Speedup

Obtain 4× and 13× speedup compared to Tensor Toolbox and Ctf. Get close to Gemm-only’s performance. 10 20 30 40 50 GEMM CTF TT-TTM InTTM Tensor Size 3-D: 10003 4-D: 1804 5-D: 605 Performance (GFLOP/s) Performance comparison of Ttm on mode-2 over diverse dimensional tensors.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 27 / 33

SLIDE 30

Experiments and Analysis

Analysis

Performance of different modes.

InTensLi is stable for different mode-n products, while Tensor Toolbox is not.

10 20 30 40 50 TT-TTM InTTM 4 3 2 1 Mode Performance (GFLOP/s)

Performance behavior of InTtm against Tensor Toolbox (TT-TTM) for different mode products on a 160 × 160 × 160 × 160 tensor.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 28 / 33

SLIDE 31

Experiments and Analysis

Analysis

Parameter selection

Compare InTensLi with exhaustive search, the performance is close to optimal.

40 50 60 70 80 Performance (GFLOP/s) Tensor Size 10 20 30 40 Best InTensLi Comparison between the performance of Ttm on mode-1 with predicted configuration and the actually highest performance on 5th-order tensors.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 29 / 33

SLIDE 32

Conclusion

Summary Proposed an in-place tensor-times-matrix multiply (InTtm) algorithm, by avoiding physical reorganization of tensors. Built an input-adaptive framework InTensLi to automatically do optimization and generate the code. Achieved 4× and 13× speedups compared to the state-of-the-art Tensor Toolbox and Ctf tools. Future Integrate it into Tucker and other tensor decompositions. Explore similar strategy for sparse tensors. Source code: https://github.com/hpcgarage/InTensLi.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 30 / 33

SLIDE 33

Conclusion

Backup Slides

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 31 / 33

SLIDE 34

References

E. Solomonik, D. Matthews, J. Hammond, and J. Dem- mel. Cyclops tensor framework:

reducing commu- nication and eliminating load imbalance in massively parallel

contractions. Technical Report UCB/EECS- 2012-210, EECS Department, University of

California, Berkeley, Nov 2012.

B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.5. Available from

http://www.sandia.gov/~tgkolda/TensorToolbox/index-2.6.html, January 2012

T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review,

51(3):455–500, 2009. ...

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 32 / 33

SLIDE 35

References

Observation 1: Transformation is expensive.

Transformation takes about 70% of the total run-time, and close to 50% of the total storage.

0.0 0.2 0.4 0.6 0.8 1.0 Multiply Transform 10003 2004 605 N

r

m a l i z e d T i m e Tensor Size

(a) Time Profiling

0.0 0.2 0.4 0.6 0.8 1.0 Multiply Transform 10003 2004 605 N

r

m a l i z e d S p a c e Tensor Size

(b) Space Profiling

Profiling of Ttm algorithm on mode-2 product on 3rd, 4th, and 5th-order tensors, where the output tensors are low-rank representations of corresponding input tensors.

J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 33 / 33