A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific - - PowerPoint PPT Presentation

a sparse tensor format and a benchmark suite
SMART_READER_LITE
LIVE PREVIEW

A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific - - PowerPoint PPT Presentation

A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific Northwest National Laboratory January 25, 2019 @ MIT Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores HiCOO: Hierarchical Storage of Sparse


slide-1
SLIDE 1

A Sparse Tensor Format and a Benchmark Suite

Jiajia Li

Pacific Northwest National Laboratory

January 25, 2019 @ MIT

Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

slide-2
SLIDE 2

HiCOO: Hierarchical Storage of Sparse Tensors

Jiajia Li 1,2, Jimeng Sun 1, Richard Vuduc 1

1 Georgia Institute of Technology 2 Pacific Northwest National Laboratory

Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

SUNLAB

Code: https://github.com/hpcgarage/ParTI (v1.0.0)

slide-3
SLIDE 3

Challenges

  • 3

Compactness: A space-efficient data structure Mode-Genericity: Efficient traversals of the data structure for computations

The concept “mode-genericity” is inherited from [Baskaran et al. 2012]. [Baskaran et al. 2012] M. Baskaran et al., “Efficient and scalable computations with sparse tensors,” HPEC2012

slide-4
SLIDE 4

Baseline Sparse Tensor Formats in This Work

  • 4

(a) COO i j k val 1 1 2 1 3 1 2 4 2 1 5 2 2 2 6 3 1 3 3 2 7 8 (b) CSF i j k val 1 1 2 1 3 2 4 2 1 5 1 3 3 2 2 2 6 7 8 (c) F-COO bf j k val 1 1 1 2 1 3 2 4 1 1 5 2 2 6 1 1 3 2 7 8 sf[0]=1 sf[1]=1

Mode-Generic Mode-Specific

prefer different representations for different modes.

i = 1,…,I j = 1 , … , J k = 1,…,K

4 4 3

COO: coordinate formats [Bader et al., 2006] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 2015] F-COO: Flagged COO format [Liu et al., 2017]

slide-5
SLIDE 5

Mode-Specific Tensor Formats

  • 5

Three CSF/F-COO representations are required/preferred for three kernels.

i j k val 1 1 2 1 3 2 4 2 1 5 1 3 3 2 2 2 6 7 8

CSF-1 CSF-2 CSF-3

j i k val 1 1 3 1 4 2 2 2 5 3 3 2 2 2 6 7 8 2 1 1 3 k i j val 1 1 2 2 1 3 2 2 4 6 1 3 3 3 8 5 7 1 1 2

Tensor Decomposition Kernel in Mode-1 Kernel in Mode-2 Kernel in Mode-3

slide-6
SLIDE 6

Mode-Specific Tensor Formats

  • 6

Three CSF/F-COO representations are required/preferred for three kernels.

i j k val 1 1 2 1 3 2 4 2 1 5 1 3 3 2 2 2 6 7 8

CSF-1

Tensor Decomposition Kernel in Mode-1 Kernel in Mode-2 Kernel in Mode-3

Performance drops

slide-7
SLIDE 7

Mode Orientation

  • 7

Mode-1 oriented (CSF/FCOO) Coordinate (COO)

Tensor decomposition

Kernel in Mode-1 Kernel in Mode-2 Kernel in Mode-3

Mode-Specific Mode-Generic

HiCOO Efficient In-efficient

slide-8
SLIDE 8

HiCOO Format

  • 8

Store a sparse tensor in units of small sparse blocks

i = 1,…,I j = 1,…,J k = 1,…,K

Block size: 2*2*2 HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

Extension from Compressed Sparse Blocks (CSB) format by Buluc et al. SPAA. 2009.

slide-9
SLIDE 9

HiCOO Format

  • 9

i = bi * B + ei 32-bit 8-bit 32-bit block indices element indices

Store a sparse tensor in units of small sparse blocks

  • Shorten the bit-length of element indices

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

i = 1,…,I j = 1,…,J k = 1,…,K

Block size: 2*2*2

slide-10
SLIDE 10

HiCOO Format

  • 10

Store a sparse tensor in units of small sparse blocks

  • Shorten the bit-length of element indices
  • Compress the number of block indices

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

8-bit 32-bit block indices element indices 32-bit

slide-11
SLIDE 11

HiCOO Format

  • 11

i = bi * B + ei 32-bit 8-bit 32-bit COO indices: = nnz * 3 * 32 HiCOO indices: = nnz * 3 * 8 + nnb * (3 * 32 + 32) block indices element indices

Store a sparse tensor in units of small sparse blocks

  • Shorten the bit-length of element indices
  • Compress the number of block indices

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

nnz: #Nonzeros; nnb: #Non-zero blocks

slide-12
SLIDE 12

HiCOO Format

  • 12

32-bit 8-bit 32-bit

For the tensor: Reduce its storage and memory footprints For matrices: Better data locality

Store a sparse tensor in units of small sparse blocks

  • Shorten the bit-length of element indices
  • Compress the number of block indices
  • For arbitrary-order sparse tensors.

HiCOO

ek val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 0 4 0 1 0 5 1 0 1 7 0 0 0 1 1 0 6 8 0 0 0 0 0 1 1 0 0 1 1 1 3 4 6

ei ej bi bj bk bptr

B1 B2 B3 B4 COO

i j k val

0 0 0 1 0 1 0 2 1 0 0 3 1 0 2 4 2 1 0 5 2 2 2 6 3 0 1 3 3 2 7 8

slide-13
SLIDE 13

Platform and Dataset

  • 13

Platform: Intel Xeon CPU E7-4850 v3 platform consisting 56 physical cores with icc 18.0.2 and parallelized by OpenMP. Dataset: FROSTT [Smith et al. 2017], HaTen2 [Jeon et al. 2015], and healthcare data [Perros et al. 2017].

slide-14
SLIDE 14

Multicore CP-ALS

  • 14

HiCOO outperforms COO by 6.2× and CSF by 2.1× on average.

3D 4D

  • choa

choa choa darpa darpa darpa deli deli deli fb−m fb−m fb−m fb−s fb−s fb−s nell1 nell1 nell1 nell2 nell2 nell2

COO CSF−1 HiCOO

  • crime

crime crime deli4d deli4d deli4d enron enron enron flickr flickr flickr nips

nips

nips

COO CSF−1 HiCOO 0.25 0.50 1.00 2.00 4.00 1 2 4 1 2 4

Compression ratio relative to CSF (higher is better) Speedup over CSF (higher is better)

slide-15
SLIDE 15

Following Work

  • 15

HiCOO for other tensor operations and Tucker decomposition HiCOO-MTTKRP/CPD on GPUs and distributed systems.

slide-16
SLIDE 16

PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite

Jiajia Li 1, Yuchen Ma 2, Xiaolong Wu 3, Ang Li 1, Kevin Barker 1

1 Pacific Northwest National Laboratory 2 Hangzhou Dianzi University 3 Virginia Tech Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores

Code: https://gitlab.com/tensorworld/pasta

slide-17
SLIDE 17

PASTA Workloads

Data Structures/ Algorithms

Platforms TEW

(Element-Wise)

TS

(Tensor-scalar)

TTV

(Tensor-Times- Vector)

TTM

(Tensor-Times- Matrix)

MTTKRP

(Matriced Tensor-Times- Khatri-Rao Product)

COO Single-core CPUs Multi-core CPUs

slide-18
SLIDE 18

PASTA Workloads

Data Structures/ Algorithms

Platforms TEW

(Element-Wise)

TS

(Tensor-scalar)

TTV

(Tensor-Times- Vector)

TTM

(Tensor-Times- Matrix)

MTTKRP

(Matriced Tensor-Times- Khatri-Rao Product)

COO Single-core CPUs Multi-core CPUs

Arbitrary shape and nonuniform nonzero pattern

slide-19
SLIDE 19

PASTA Workloads

Data Structures/ Algorithms

Platforms TEW

(Element-Wise)

TS

(Tensor-scalar)

TTV

(Tensor-Times- Vector)

TTM

(Tensor-Times- Matrix)

MTTKRP

(Matriced Tensor-Times- Khatri-Rao Product)

COO Single-core CPUs Multi-core CPUs

Parallelize nonzeros Parallelize nonzero fibers Parallelize nonzeros with atomics Parallelize nonzero partitions

slide-20
SLIDE 20

Memory-Bound Workloads

  • 20
slide-21
SLIDE 21

Following Work

  • 21

Include HiCOO, CSF and other formats Support GPUs, FPGAs (long-term future)

slide-22
SLIDE 22

Other Recent Work

  • 22

A dynamic sparse tensor structure for tensor contraction

  • Collaborators: Sriram Krishnamoorthy (PNNL)
  • Application: Quantum Chemistry, NWChemEx

Hybrid formats and nonzero partitioning strategies

  • Collaborators: Israt Nisa (OSU), P. (Saday) Sadayappan (OSU), Sriram Krishnamoorthy (PNNL)
slide-23
SLIDE 23

Acknowledgement

  • 23