An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix - - PowerPoint PPT Presentation

an input adaptive and in place approach to dense tensor
SMART_READER_LITE
LIVE PREVIEW

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix - - PowerPoint PPT Presentation

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply Jiajia Li , Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc Computational Science & Engineering, Georgia Institute of Technology 19 th Nov 2015 19 th


slide-1
SLIDE 1

An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply

Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc

Computational Science & Engineering, Georgia Institute of Technology

19th Nov 2015

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 1 / 33

slide-2
SLIDE 2

The problem

J K I F F J K

Y=X xn U

I

TTM

X Y U

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 2 / 33

slide-3
SLIDE 3

The problem

J K I F F J K

Y=X xn U

I JK I F I JK F

Y(n)=UX(n)

GEMM TTM

X Y U Y(n) X(n) U

Transform Transform

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 3 / 33

slide-4
SLIDE 4

The problem

J K I F F J K

Y=X xn U

I JK I F I JK F

Y(n)=UX(n)

GEMM TTM

X Y U Y(n) X(n) U

Transform Transform

Transform: 70% running time. 50% space. We proposed an in-place TTM algorithm and employed auto-tuning method to adapt its parameters.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 4 / 33

slide-5
SLIDE 5

Outline

Background Motivation InTensLi Framework Experiments and Analysis Conclusion

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 5 / 33

slide-6
SLIDE 6

Background

Tensor and Applications

Tensor: interpreted as a multi-dimensional array, e.g. X ∈ RI×J×K.

Special cases: vectors (x) are 1D tensors, and matrices (A)are 2D tensors. Tensor dimension (N): also called mode or order. We focus on dense tensors in this work.

Applications

Quantum chemistry, quantum physics, signal and image processing, neuroscience, and data analytics.

j=1,...,J k = 1 , . . . , K i=1,...,I

A third-order (or three-dimensional) I × J × K tensor.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 6 / 33

slide-7
SLIDE 7

Background

Tensor Representations

Sub-tensor

X(i, :, :) X(:, j, :) X(:, :, k)

Horizontal Lateral Frontal Slices

I J K

X(:, j, k) X(i, :, k) X(i, j, :) Column Row Tube Fibers

I J K Whole tensor

J=2 K = 2 I=2

1 3 2 4 5 7 6 8 1 2 3 4 5 6 7 8

J=2 IK=4 X X(2)

Matricization Tensorization

Diff representations → Diff algorithms → Diff performance.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 7 / 33

slide-8
SLIDE 8

Background

Memory Mapping

Tensor organization

Multi-dimensional array – logically Linear storage – physically

Memory mapping1.

J=2 K=2 I=2

1 3 2 4 5 7 6 8 1 2 3 4 6 7 8

X

5

Row-major (LDim: k)

Logical Physical

1 2 3 4 6 7 8

Mapping Function

5

Column-major (LDim: 1)

LDim: Leading Dimension

K -> J -> I I -> J -> K

1GARCIA, R.,and LUMSDAINE, A. Multiarray:A c++ library for generic programming with arrays.Software Practive Experience 35 (2004), 159–188.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 8 / 33

slide-9
SLIDE 9

Background

Tensor Operations

Matricization, aka unfolding or flattening. Mode-n product, aka tensor-times-matrix multiply (Ttm)

J K I F F J K

TTM on Mode-1

Y=X xn U

I JK I F I JK F

Y(n)=UX(n)

Tensor contraction, Kronecker product, Matricized tensor times Khatri-Rao product (MTTKRP) etc.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 9 / 33

slide-10
SLIDE 10

Background

Ttm Algorithm

Baseline Ttm algorithm in Tensor Toolbox and Cyclops Tensor Framework (Ctf).

Input: Matricization Tensorization Output: Y(n)=UX(n) Multiplication: Transformation

X Y X(n) U Y(n)

Ttm Applications

Low-rank tensor decomposition. Tucker decomposition, e.g. Tucker-HOOI algorithm. Y = X ×1 A(1)T · · · ×n−1 A(n−1)T ×n+1 A(n+1)T · · · ×N A(N)T.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 10 / 33

slide-11
SLIDE 11

Background

Main Contributions

Proposed an in-place tensor-times-matrix multiply (InTtm) algorithm, by avoiding physical reorganization of tensors. Built an input-adaptive framework InTensLi to automatically adapt parameters and generate the code. Achieved 4× and 13× speedups compared to the state-of-the-art Tensor Toolbox and Ctf tools.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 11 / 33

slide-12
SLIDE 12

Motivation

Observation 1: Transformation is expensive.

Notation: the number of words (Q), floating-point operations (W ), last-level cache size (Z). The relation of them is Q ≥

W 8 √ Z − Z 2 for both general matrix-matrix multiply (Gemm) and

Ttm. Suppose Ttm does the same number of flops as Gemm ( ˆ W = W ), the relation of Arithmetic Intensity of Gemm and Ttm: ˆ A ≈ A/(1 + A

m) when counting

transformation. (1 + A

m) is the penalty.

Assume cache size Z is 8MB, the penalty of a 3-D tensor is 33. Conclusion: When Ttm and Gemm do the same number of flops, Arithmetic Intensity of Ttm is decreased by a penalty of 33 or more, as tensor dimension increases.

  • 2G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear
  • algebra. Acta Numerica, 23:pp. 1–155, 2014.
  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 12 / 33

slide-13
SLIDE 13

Motivation

Observation 2: Performance of the multiplication in Ttm is far below peak.

Ttm algorithm involves a variety of rectangular problem sizes.

(tiny) (short) (short) (fat) m=16 k

X(n) U

k=In n=I1...In-1In+1...IN

(a) TTM’s multiplication. log2k

1 2 3 4 5 6 7 8 9 10 11 12 13 14

log2n

1 2 3 4 5 6 7 8 9 10 11 12 13 14

20 40 60 80 100 120 140

(b) GEMM performance in Intel MKL with 4 threads.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 13 / 33

slide-14
SLIDE 14

Motivation

Observation 3: Ttm organization is critical to data locality.

There are many ways to organize data accesses. I F K I F J X(: ,:, k) X(:, j, :) U U

Non-Contigunous Contigunous

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 14 / 33

slide-15
SLIDE 15

Motivation

Observation 3: Ttm organization is critical to data locality.

There are many ways to organize data accesses. Choose slice representation.

Table 1 : Different representation forms of mode-1 Ttm on a I × J × K tensor.

Mode-1 Product Representation Forms BLAS Level Transformation Full reorganization Tensor representation — — Y = X ×1 U Matrix representation L3 Yes Y(1) = UX(1) Sub-tensor extraction Fiber representation L2 No y(f , :, k) = X(:, :, k)u(f , :), Loops : k = 1, · · · , K, f = 1, · · · , F Slice representation L3 No Y(:, :, k) = UX(:, :, k), Loops : k = 1, · · · , K

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 15 / 33

slide-16
SLIDE 16

InTensLi Framework Algorithmic Strategy

Layout

1

Background

2

Motivation

3

InTensLi Framework Algorithmic Strategy InTensLi Framework

4

Experiments and Analysis

5

Conclusion

6

References

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 16 / 33

slide-17
SLIDE 17

InTensLi Framework Algorithmic Strategy

Algorithmic Strategy

...

I1 I2 I3 I4 IN

{

n-1 N-n

{

backward forward

Xsub

I3 I4I5...IN

{

I1I2 Group

To avoid data copy,

Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max{n − 1, N − n} contiguous dimensions without physical reorganization.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 17 / 33

slide-18
SLIDE 18

InTensLi Framework Algorithmic Strategy

Algorithmic Strategy

...

I1 I2 I3 I4 IN

{

n-1 N-n

{

backward forward

Xsub

I3 I4I5...IN

{

I1I2 Group

To avoid data copy,

Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max{n − 1, N − n} contiguous dimensions without physical reorganization.

To get high performance of Gemm,

Find an approximate matrix size according to computer architecture. Use auto-tuning method in InTensLi framework.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 17 / 33

slide-19
SLIDE 19

InTensLi Framework Algorithmic Strategy

InTtm Algorithm and Comparison

InTtm’s AI: ˜ A .

ˆ Q

ˆ Q 8 √ Z

= 8 √ Z ≈ A. Traditional Ttm’s AI: ˆ A ≈

A 1+ A

m .

InTtm eliminates the AI by a factor 1 + A

m. In-place Tensor-Times-Matrix Multiply (InTtm) algorithm.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 18 / 33

slide-20
SLIDE 20

InTensLi Framework Algorithmic Strategy

InTtm Algorithm and Comparison

InTtm’s AI: ˜ A .

ˆ Q

ˆ Q 8 √ Z

= 8 √ Z ≈ A. Traditional Ttm’s AI: ˆ A ≈

A 1+ A

m .

InTtm eliminates the AI by a factor 1 + A

m. In-place Tensor-Times-Matrix Multiply (InTtm) algorithm.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 19 / 33

slide-21
SLIDE 21

InTensLi Framework InTensLi Framework

Layout

1

Background

2

Motivation

3

InTensLi Framework Algorithmic Strategy InTensLi Framework

4

Experiments and Analysis

5

Conclusion

6

References

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 20 / 33

slide-22
SLIDE 22

InTensLi Framework InTensLi Framework

InTensLi Framework

Input: tensor features, hardware configuration, and MM benchmark. Parameter estimation

Mode partitioning: ML and MC. Thread allocation: PL and PC.

Code generation

Hardware Parameters Max # of threads MM Benchmark Input Parameters Tensor Thresholds Mode n Data Layout Code Generator ML M

C

P

L

P

C

Parameter Estimator InTTM Code Mode Partition Thread Allocation A ect

parfor i1=1 : I1 parfor i2 = 1 : I2 ... ...

MM Libraries BLIS MKL

Nested loops Matrix-matrix Multiplication

Ysub=UXsub Ysub=XsubU’ OR

InTensLi framework

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 21 / 33

slide-23
SLIDE 23

InTensLi Framework InTensLi Framework

Parameter Estimation – Mode Partitioning

Decide forward/backward strategy.

Row-major: forward strategy. Column-major: backward strategy.

I1 I2 I3 I4 I6

{

Group 2 3

{

backward forward

Xsub

I3 I4I5I6

{

I1I2 I5

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 22 / 33

slide-24
SLIDE 24

InTensLi Framework InTensLi Framework

Parameter Estimation – Mode Partitioning

Chosen forward strategy. Group size decides InTtm algorithm.

I1 I2 I3 I4 I6

{

Group size 2 3

{

I5

X1

sub

I3 I4I5I6

{

I1I2

{

{

I1I2I4 I3 I5I6

X2

sub

3 2

MC ML ML MC

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 23 / 33

slide-25
SLIDE 25

InTensLi Framework InTensLi Framework

Choosing Group Size

30 60 90 120 14 13 12 11 10 9 8 7 6 5 4 3 2 1 s / P O L F G ( e c n a m r

  • f

r e P log2n 80% MSTH MLTH

X3

sub

I4I5I6 I5I6 I6

X1

sub

X2

sub

I3 I3

MM benchmark

  • n 16 x 512 and 512 x n

matrices using 4 threads.

I3

MC={I5,I6} ML={I1,I2,I4}

MSTH and MLTH: Thresholds of Gemm’s size, the size of all the three

  • perating matrices.

MSTH = 1.04MB and MLTH = 7.04MB in our experiments.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 24 / 33

slide-26
SLIDE 26

InTensLi Framework InTensLi Framework

Choosing Group Size

30 60 90 120 14 13 12 11 10 9 8 7 6 5 4 3 2 1 s / P O L F G ( e c n a m r

  • f

r e P log2n 80% MSTH MLTH

X3

sub

I4I5I6 I5I6 I6

X1

sub

X2

sub

I3 I3

MM benchmark

  • n 16 x 512 and 512 x n

matrices using 4 threads.

I3

MC={I5,I6} ML={I1,I2,I4}

MSTH and MLTH: Thresholds of Gemm’s size, the size of all the three

  • perating matrices.

MSTH = 1.04MB and MLTH = 7.04MB in our experiments. Decide MC: Use MSTH and MLTH to decide group size, then decide MC. Decide ML: The rest modes of MC, except mode-n.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 24 / 33

slide-27
SLIDE 27

InTensLi Framework InTensLi Framework

Thread Allocation and Code Generation

Thread allocation

In most cases, maximum performance is obtained by only two configurations:

Small matrices: all threads are allocated to nested loops. Large matrices: all threads are allocated to Gemm operation.

A threshold PTH is set to distinguish the Gemm size, which is 800 KB in our tests.

Code generation

Generate nested loops and wrappers for the Gemm kernel. Code generated in C++, using OpenMP with the collapse directive.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 25 / 33

slide-28
SLIDE 28

Experiments and Analysis

Experimental Platforms

Double-precision is adopted in our experiments. We employ 8 and 32 threads on the two platforms respectively, considering hyper-threading. Xeon E7-4820 has a relatively large memory (512 GiB), allowing us to test a larger range of (dense) tensor sizes than has been common in prior single-node studies.

Table 2 : Experimental Platforms Configuration

Intel Intel Parameters Core i7-4770K Xeon E7-4820 Microarchitecture Haswell Westmere Frequency 3.5 GHz 2.0 GHz # of physical cores 4 16 Hyper-threading On On Peak GFLOP/s 224 128 Last-level cache 8 GiB 18 GiB Memory size 32 GiB 512 GiB Memory bandwidth 25.6 GB/s 34.2 GB/s # of memory channels 2 4 Compiler icc 15.0.2 icc 15.0.0

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 26 / 33

slide-29
SLIDE 29

Experiments and Analysis

Performance Comparison

Implementations

InTtm: InTensLi generated C++ code with OpenMP. TT-TTM: Tensor Toolbox library in MATLAB. Ctf: C++ code, supporting MPI+OpenMP parallelization. Gemm: C++ code, baseline Ttm algorithm without transformation.

Speedup

Obtain 4× and 13× speedup compared to Tensor Toolbox and Ctf. Get close to Gemm-only’s performance. 10 20 30 40 50 GEMM CTF TT-TTM InTTM Tensor Size 3-D: 10003 4-D: 1804 5-D: 605 Performance (GFLOP/s) Performance comparison of Ttm on mode-2 over diverse dimensional tensors.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 27 / 33

slide-30
SLIDE 30

Experiments and Analysis

Analysis

Performance of different modes.

InTensLi is stable for different mode-n products, while Tensor Toolbox is not.

10 20 30 40 50 TT-TTM InTTM 4 3 2 1 Mode Performance (GFLOP/s)

Performance behavior of InTtm against Tensor Toolbox (TT-TTM) for different mode products on a 160 × 160 × 160 × 160 tensor.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 28 / 33

slide-31
SLIDE 31

Experiments and Analysis

Analysis

Parameter selection

Compare InTensLi with exhaustive search, the performance is close to optimal.

40 50 60 70 80 Performance (GFLOP/s) Tensor Size 10 20 30 40 Best InTensLi Comparison between the performance of Ttm on mode-1 with predicted configuration and the actually highest performance on 5th-order tensors.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 29 / 33

slide-32
SLIDE 32

Conclusion

Conclusion

Summary Proposed an in-place tensor-times-matrix multiply (InTtm) algorithm, by avoiding physical reorganization of tensors. Built an input-adaptive framework InTensLi to automatically do optimization and generate the code. Achieved 4× and 13× speedups compared to the state-of-the-art Tensor Toolbox and Ctf tools. Future Integrate it into Tucker and other tensor decompositions. Explore similar strategy for sparse tensors. Source code: https://github.com/hpcgarage/InTensLi.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 30 / 33

slide-33
SLIDE 33

Conclusion

Backup Slides

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 31 / 33

slide-34
SLIDE 34

References

References

  • E. Solomonik, D. Matthews, J. Hammond, and J. Dem- mel. Cyclops tensor framework:

reducing commu- nication and eliminating load imbalance in massively parallel

  • contractions. Technical Report UCB/EECS- 2012-210, EECS Department, University of

California, Berkeley, Nov 2012.

  • B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.5. Available from

http://www.sandia.gov/~tgkolda/TensorToolbox/index-2.6.html, January 2012

  • T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review,

51(3):455–500, 2009. ...

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 32 / 33

slide-35
SLIDE 35

References

Observation 1: Transformation is expensive.

Transformation takes about 70% of the total run-time, and close to 50% of the total storage.

0.0 0.2 0.4 0.6 0.8 1.0 Multiply Transform 10003 2004 605 N

  • r

m a l i z e d T i m e Tensor Size

(a) Time Profiling

0.0 0.2 0.4 0.6 0.8 1.0 Multiply Transform 10003 2004 605 N

  • r

m a l i z e d S p a c e Tensor Size

(b) Space Profiling

Profiling of Ttm algorithm on mode-2 product on 3rd, 4th, and 5th-order tensors, where the output tensors are low-rank representations of corresponding input tensors.

  • J. Li et.al. (CSE, GaTech)

InTensLi 19th Nov 2015 33 / 33