[PPT] - A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific PowerPoint Presentation

SLIDE 1

A Sparse Tensor Format and a Benchmark Suite

Jiajia Li

Pacific Northwest National Laboratory

January 25, 2019 @ MIT

Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

SLIDE 2

HiCOO: Hierarchical Storage of Sparse Tensors

Jiajia Li 1,2, Jimeng Sun 1, Richard Vuduc 1

1 Georgia Institute of Technology 2 Pacific Northwest National Laboratory

Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

SUNLAB

Code: https://github.com/hpcgarage/ParTI (v1.0.0)

SLIDE 3

Challenges

3

Compactness: A space-efficient data structure Mode-Genericity: Efficient traversals of the data structure for computations

The concept “mode-genericity” is inherited from [Baskaran et al. 2012]. [Baskaran et al. 2012] M. Baskaran et al., “Efficient and scalable computations with sparse tensors,” HPEC2012

SLIDE 4

Baseline Sparse Tensor Formats in This Work

4

(a) COO i j k val 1 1 2 1 3 1 2 4 2 1 5 2 2 2 6 3 1 3 3 2 7 8 (b) CSF i j k val 1 1 2 1 3 2 4 2 1 5 1 3 3 2 2 2 6 7 8 (c) F-COO bf j k val 1 1 1 2 1 3 2 4 1 1 5 2 2 6 1 1 3 2 7 8 sf[0]=1 sf[1]=1

Mode-Generic Mode-Specific

prefer different representations for different modes.

i = 1,…,I j = 1 , … , J k = 1,…,K

4 4 3

COO: coordinate formats [Bader et al., 2006] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 2015] F-COO: Flagged COO format [Liu et al., 2017]

SLIDE 5

Mode-Specific Tensor Formats

5

Three CSF/F-COO representations are required/preferred for three kernels.

i j k val 1 1 2 1 3 2 4 2 1 5 1 3 3 2 2 2 6 7 8

CSF-1 CSF-2 CSF-3

j i k val 1 1 3 1 4 2 2 2 5 3 3 2 2 2 6 7 8 2 1 1 3 k i j val 1 1 2 2 1 3 2 2 4 6 1 3 3 3 8 5 7 1 1 2

Tensor Decomposition Kernel in Mode-1 Kernel in Mode-2 Kernel in Mode-3

SLIDE 6

Mode-Specific Tensor Formats

6

Three CSF/F-COO representations are required/preferred for three kernels.

i j k val 1 1 2 1 3 2 4 2 1 5 1 3 3 2 2 2 6 7 8

CSF-1

Tensor Decomposition Kernel in Mode-1 Kernel in Mode-2 Kernel in Mode-3

Performance drops

SLIDE 7

Mode Orientation

7

Mode-1 oriented (CSF/FCOO) Coordinate (COO)

Tensor decomposition

Kernel in Mode-1 Kernel in Mode-2 Kernel in Mode-3

Mode-Specific Mode-Generic

HiCOO Efficient In-efficient

SLIDE 8

HiCOO Format

8

Store a sparse tensor in units of small sparse blocks

i = 1,…,I j = 1,…,J k = 1,…,K

Block size: 222 HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

Extension from Compressed Sparse Blocks (CSB) format by Buluc et al. SPAA. 2009.

SLIDE 9

HiCOO Format

9

i = bi * B + ei 32-bit 8-bit 32-bit block indices element indices

Store a sparse tensor in units of small sparse blocks

Shorten the bit-length of element indices

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

i = 1,…,I j = 1,…,J k = 1,…,K

Block size: 222

SLIDE 10

HiCOO Format

10

Store a sparse tensor in units of small sparse blocks

Shorten the bit-length of element indices
Compress the number of block indices

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

8-bit 32-bit block indices element indices 32-bit

SLIDE 11

HiCOO Format

11

i = bi * B + ei 32-bit 8-bit 32-bit COO indices: = nnz * 3 * 32 HiCOO indices: = nnz * 3 * 8 + nnb * (3 * 32 + 32) block indices element indices

Store a sparse tensor in units of small sparse blocks

Shorten the bit-length of element indices
Compress the number of block indices

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

nnz: #Nonzeros; nnb: #Non-zero blocks

SLIDE 12

HiCOO Format

12

32-bit 8-bit 32-bit

For the tensor: Reduce its storage and memory footprints For matrices: Better data locality

Store a sparse tensor in units of small sparse blocks

Shorten the bit-length of element indices
Compress the number of block indices
For arbitrary-order sparse tensors.

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

SLIDE 13

Platform and Dataset

13

Platform: Intel Xeon CPU E7-4850 v3 platform consisting 56 physical cores with icc 18.0.2 and parallelized by OpenMP. Dataset: FROSTT [Smith et al. 2017], HaTen2 [Jeon et al. 2015], and healthcare data [Perros et al. 2017].

SLIDE 14

Multicore CP-ALS

14

HiCOO outperforms COO by 6.2× and CSF by 2.1× on average.

3D 4D

choa

choa choa darpa darpa darpa deli deli deli fb−m fb−m fb−m fb−s fb−s fb−s nell1 nell1 nell1 nell2 nell2 nell2

COO CSF−1 HiCOO

crime

crime crime deli4d deli4d deli4d enron enron enron flickr flickr flickr nips

nips

COO CSF−1 HiCOO 0.25 0.50 1.00 2.00 4.00 1 2 4 1 2 4

Compression ratio relative to CSF (higher is better) Speedup over CSF (higher is better)

SLIDE 15

Following Work

15

HiCOO for other tensor operations and Tucker decomposition HiCOO-MTTKRP/CPD on GPUs and distributed systems.

SLIDE 16

PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite

Jiajia Li 1, Yuchen Ma 2, Xiaolong Wu 3, Ang Li 1, Kevin Barker 1

1 Pacific Northwest National Laboratory 2 Hangzhou Dianzi University 3 Virginia Tech Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

Code: https://gitlab.com/tensorworld/pasta

SLIDE 17

PASTA Workloads

Data Structures/ Algorithms

Platforms TEW

(Element-Wise)

TS

(Tensor-scalar)

TTV

(Tensor-Times- Vector)

TTM

(Tensor-Times- Matrix)

MTTKRP

(Matriced Tensor-Times- Khatri-Rao Product)

COO Single-core CPUs Multi-core CPUs

SLIDE 18

PASTA Workloads

Data Structures/ Algorithms

Platforms TEW

(Element-Wise)

TS

(Tensor-scalar)

TTV

(Tensor-Times- Vector)

TTM

(Tensor-Times- Matrix)

MTTKRP

(Matriced Tensor-Times- Khatri-Rao Product)

COO Single-core CPUs Multi-core CPUs

Arbitrary shape and nonuniform nonzero pattern

SLIDE 19

PASTA Workloads

Data Structures/ Algorithms

Platforms TEW

(Element-Wise)

TS

(Tensor-scalar)

TTV

(Tensor-Times- Vector)

TTM

(Tensor-Times- Matrix)

MTTKRP

(Matriced Tensor-Times- Khatri-Rao Product)

COO Single-core CPUs Multi-core CPUs

Parallelize nonzeros Parallelize nonzero fibers Parallelize nonzeros with atomics Parallelize nonzero partitions

SLIDE 20

Memory-Bound Workloads

20

SLIDE 21

Following Work

21

Include HiCOO, CSF and other formats Support GPUs, FPGAs (long-term future)

SLIDE 22

Other Recent Work

22

A dynamic sparse tensor structure for tensor contraction

Collaborators: Sriram Krishnamoorthy (PNNL)
Application: Quantum Chemistry, NWChemEx

Hybrid formats and nonzero partitioning strategies

Collaborators: Israt Nisa (OSU), P. (Saday) Sadayappan (OSU), Sriram Krishnamoorthy (PNNL)

SLIDE 23

Acknowledgement

23

A Sparse Tensor Format and a Benchmark Suite

Jiajia Li

Pacific Northwest National Laboratory

January 25, 2019 @ MIT

HiCOO: Hierarchical Storage of Sparse Tensors

Jiajia Li 1,2, Jimeng Sun 1, Richard Vuduc 1

Code: https://github.com/hpcgarage/ParTI (v1.0.0)

Challenges

Compactness: A space-efficient data structure Mode-Genericity: Efficient traversals of the data structure for computations

Baseline Sparse Tensor Formats in This Work

COO: coordinate formats [Bader et al., 2006] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 2015] F-COO: Flagged COO format [Liu et al., 2017]

Mode-Specific Tensor Formats

Three CSF/F-COO representations are required/preferred for three kernels.

CSF-1 CSF-2 CSF-3

Mode-Specific Tensor Formats

Three CSF/F-COO representations are required/preferred for three kernels.

CSF-1

Performance drops

Mode Orientation

Tensor decomposition

Mode-Specific Mode-Generic

HiCOO Format

Store a sparse tensor in units of small sparse blocks

Block size: 2*2*2 HiCOO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

B1 B2 B3 B4 COO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

HiCOO Format

Store a sparse tensor in units of small sparse blocks

HiCOO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

B1 B2 B3 B4 COO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

Block size: 2*2*2

HiCOO Format

Store a sparse tensor in units of small sparse blocks

HiCOO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

B1 B2 B3 B4 COO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

HiCOO Format

Store a sparse tensor in units of small sparse blocks

HiCOO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

B1 B2 B3 B4 COO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

HiCOO Format

For the tensor: Reduce its storage and memory footprints For matrices: Better data locality

Store a sparse tensor in units of small sparse blocks

HiCOO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

B1 B2 B3 B4 COO

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

Platform and Dataset

Platform: Intel Xeon CPU E7-4850 v3 platform consisting 56 physical cores with icc 18.0.2 and parallelized by OpenMP. Dataset: FROSTT [Smith et al. 2017], HaTen2 [Jeon et al. 2015], and healthcare data [Perros et al. 2017].

Multicore CP-ALS

HiCOO outperforms COO by 6.2× and CSF by 2.1× on average.

Following Work

HiCOO for other tensor operations and Tucker decomposition HiCOO-MTTKRP/CPD on GPUs and distributed systems.

PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite

Jiajia Li 1, Yuchen Ma 2, Xiaolong Wu 3, Ang Li 1, Kevin Barker 1

Code: https://gitlab.com/tensorworld/pasta

PASTA Workloads

Platforms TEW

TS

TTV

TTM

MTTKRP

COO Single-core CPUs Multi-core CPUs

PASTA Workloads

Platforms TEW

TS

TTV

TTM

MTTKRP

COO Single-core CPUs Multi-core CPUs

PASTA Workloads

Platforms TEW

TS

TTV

Block size: 222 HiCOO

Block size: 222