TITRE DE LA THESE Pattern Analysis for Source-Code Performance - - PowerPoint PPT Presentation

titre de la these
SMART_READER_LITE
LIVE PREVIEW

TITRE DE LA THESE Pattern Analysis for Source-Code Performance - - PowerPoint PPT Presentation

Data-Layout Optimization based on Memory-Access- TITRE DE LA THESE Pattern Analysis for Source-Code Performance Improvement Authors: Riyane SID LAKHDAR, Henri-Pierre CHARLES, Maha KOOLI Univ Grenoble Alpes, CEA, List, F-38000 Grenoble, France


slide-1
SLIDE 1

Sankt Goar, Germany | May 26th 2020 23rd International Workshop on Software and Compilers for Embedded Systems (SCOPES '20)

TITRE DE LA THESE

Data-Layout Optimization based on Memory-Access- Pattern Analysis for Source-Code Performance Improvement Authors: Riyane SID LAKHDAR, Henri-Pierre CHARLES, Maha KOOLI Univ Grenoble Alpes, CEA, List, F-38000 Grenoble, France

slide-2
SLIDE 2

| 2

CONTEXT AND MOTIVATIONS

  • Scientific application crosses different HW technologies

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-3
SLIDE 3

| 3

CONTEXT AND MOTIVATIONS

  • Scientific application crosses different HW technologies
  • Important time/engineering effort to keep apps adapted to HW

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-4
SLIDE 4

| 4

PROBLEM: DATA LAYOUT FOR HW/SW PERFORMANCE

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-5
SLIDE 5

| 5

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

PROBLEM: DATA LAYOUT FOR HW/SW PERFORMANCE

slide-6
SLIDE 6

| 6

Problem:

  • possible implementations for the matrix data-layout
  • Overall performances deeply impacted[SidLakhdar_2019]

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

[SidLakhdar_2019] Sid Lakhdar Riyane et al. “Toward Modeling of Cache-Miss Ratio for Dense-Data-Access-Based Optimization”. In RSP 2019. ACM.

OBJECTIVE AND METHOD

slide-7
SLIDE 7

| 7

Problem:

  • possible implementations for the matrix data-layout
  • Overall performances deeply impacted[SidLakhdar_2019]

Objective:

Automatically detect the most efficient data-layout implementation:

  • For each variable
  • With regards to the host hardware (memory)

Method:

Map the detected memory-access pattern with a known optimized implementation

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

[SidLakhdar_2019] Sid Lakhdar Riyane et al. “Toward Modeling of Cache-Miss Ratio for Dense-Data-Access-Based Optimization”. In RSP 2019. ACM.

OBJECTIVE AND METHOD

slide-8
SLIDE 8

| 8

OUTLINES

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

State of the art: Pattern detection, usage and DLD HARDSI: Hardware Adapted Restructuring of Data Structure Implementation Experimental Results Enhancing HARDSI with Data-cache modeling

slide-9
SLIDE 9

| 9

STATE OF THE ART: PATTERN DETECTION

What is a memory access pattern: The accesses are either: a) Addresses (virtual/physical) b) Indexes (e.g. array) c) Transformation of a) or b)

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

[xu_2019] Xu Zhixing, Ray Sayak, Subramanyan Pramod and Malik Sharad: «Malware detection using machine learning based analysis of virtual memory access patterns». In Proceedings of the Conference on Design, Automation & Test in Europe.

“smallest set of consecutive accesses (read and write) to a given data structure that can be repeated in order to represent the total accesses to the data structure.” [xu_2019]

slide-10
SLIDE 10

| 10

STATE OF THE ART: PATTERN DETECTION

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

Detection of memory-access pattern:

  • Intensively used by memory pre-fetchers
  • Used to predict the next addresses to be

accessed [Wilkerson_19]. Exemple:

  • Toddler[Nistor_13],
  • QUAD[Ostadzadeh_15],
  • Aristole[Fang_17]

[Wilkerson_2019] Christopher B Wilkerson et al. 2019. Instruction and logic for software hints to improve hardware prefetcher effectiveness. US Patent 10,229,060. [Nistor_13] Nistor Adrian, et al. «Toddler: Detecting performance problems via similar memory-access patterns». In Proceedings of the ICSE’13, IEEE Press. [Ostadzadeh_15] Ostadzadeh S Arash, et al. «Quad: a memory access pattern analyser». In ISARC. [Fang_17] Fang Jianbin, et al. «Aristotle: A performance impact indicator for the OpenCL kernels using local memory». In the Scientific Programming journal.

slide-11
SLIDE 11

| 11

STATE OF THE ART: PATTERN DETECTION

Detection of memory-access pattern:

  • Intensively used by memory pre-fetchers
  • Used to predict the next addresses to be

accessed [Wilkerson_19]. Exemple:

  • Toddler[Nistor_13],
  • QUAD[Ostadzadeh_15],
  • Aristole[Fang_17]

[Wilkerson_2019] Christopher B Wilkerson et al. 2019. Instruction and logic for software hints to improve hardware prefetcher effectiveness. US Patent 10,229,060. [Nistor_13] Nistor Adrian, et al. «Toddler: Detecting performance problems via similar memory-access patterns». In Proceedings of the ICSE’13, IEEE Press. [Ostadzadeh_15] Ostadzadeh S Arash, et al. «Quad: a memory access pattern analyser». In ISARC. [Fang_17] Fang Jianbin, et al. «Aristotle: A performance impact indicator for the OpenCL kernels using local memory». In the Scientific Programming journal.

Problem:

  • Granularity ~ Bytes
  • Does not scale for a data structure

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-12
SLIDE 12

| 12

STATE OF THE ART: PATTERN DETECTION

Profiling of memory-access pattern:

  • Mainly used in the detection of malware or

fault injection

  • Exemple: [Xu_2019]

[xu_2019] Xu Zhixing, Ray Sayak, Subramanyan Pramod and Malik Sharad: «Malware detection using machine learning based analysis of virtual memory access patterns». In Proceedings of the Conference on Design, Automation & Test in Europe.

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-13
SLIDE 13

| 13

STATE OF THE ART: PATTERN DETECTION

Profiling of memory-access pattern:

  • Mainly used in the detection of malware or

fault injection

  • Exemple: [Xu_2019]

[xu_2019] Xu Zhixing, Ray Sayak, Subramanyan Pramod and Malik Sharad: «Malware detection using machine learning based analysis of virtual memory access patterns». In Proceedings of the Conference on Design, Automation & Test in Europe.

Problem:

  • Granularity: virtual pages
  • Does not scale for a data structure

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-14
SLIDE 14

| 14

STATE OF THE ART: DATA-LAYOUT DECISION PROBLEM

Granularity Optimization time Target memory/application Scalar variable Allocator block Virtual page compile time run time Portable to new memories Portable to new applications

[Lian_05]

(*) (*)

[Shoushtari_18]

(*) (*)

[Serrano_19]

(*) (*)

[Doosan_08]

(*) (*)

[Kandemir_01]

(*) (*)

[Cooper_98]

(*) (*)

[Issenin_06]

(*) (*)

Riyane SID LAKHDAR et.al / CEA / SCOPES’20 [15] Lian Li et al. 2005. Memory coloring: A compiler approach for scratchpad memory. In PACT. [18] Abdolmajid Namaki Shoushtari. 2018. Software Assists to On-chip Memory Hierarchy of Manycore Embedded Systems. Ph.D. Dissertation. UC Irvine. [22] Manuel Serrano et al. 2019. Property caches revisited. In CC. [2] Doosan Cho et al. 2008. Compiler driven data layout optimization for regular/irregular array access patterns. ACM. [9] Ilya Issenin et al. 2006. Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies. In DAC. [10] Mahmut Kandemir et al. 2001. Dynamic management of scratch-pad memory space. In DAC. IEEE. [3] Keith D Cooper and Timothy J Harvey. 1998. Compiler-controlled memory. In SIGOPS OSR. ACM.

slide-15
SLIDE 15

| 15

STATE OF THE ART: DATA-LAYOUT DECISION PROBLEM

Limitation: • Require human intervention

  • No direct code specialization to hardware

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

Granularity Optimization time Target memory/application Scalar variable Allocator block Virtual page compile time run time Portable to new memories Portable to new applications

[Lian_05]

(*) (*)

[Shoushtari_18]

(*) (*)

[Serrano_19]

(*) (*)

[Doosan_08]

(*) (*)

[Kandemir_01]

(*) (*)

[Cooper_98]

(*) (*)

[Issenin_06]

(*) (*)

slide-16
SLIDE 16

| 16

OUTLINES

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

State of the art: Pattern detection, usage and DLD HARDSI: Hardware Adapted Restructuring of Data Structure Implementation Experimental Results Enhancing HARDSI with Data-cache modeling

slide-17
SLIDE 17

| 17

SCIENTIFIC APPROACH

Source Code (C/C++ based DSL)

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-18
SLIDE 18

| 18

Source Code (C/C++ based DSL)

Data Structure Var. Name @_base Access Type Size x y MATRIX res 0x2e170 WRITE 4x4 3 3 MATRIX a 0x2e010 READ 4x4 0 0 MATRIX b 0x2e0c0 READ 4x4 0 0 MATRIX res 0x2e170 UPDATE 4x4 0 0

Execution Trace

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

SCIENTIFIC APPROACH

slide-19
SLIDE 19

| 19

Source Code (C/C++ based DSL)

Data Structure Var. Name @_base Access Type Size x y MATRIX res 0x2e170 WRITE 4x4 3 3 MATRIX a 0x2e010 READ 4x4 0 0 MATRIX b 0x2e0c0 READ 4x4 0 0 MATRIX res 0x2e170 UPDATE 4x4 0 0

Execution Trace

X Y 1 2 … … N-1 1 … … N-2 N-1 N-1 N-1

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

SCIENTIFIC APPROACH

slide-20
SLIDE 20

| 20

Source Code (C/C++ based DSL)

Data Structure Var. Name @_base Access Type Size x y MATRIX res 0x2e170 WRITE 4x4 3 3 MATRIX a 0x2e010 READ 4x4 0 0 MATRIX b 0x2e0c0 READ 4x4 0 0 MATRIX res 0x2e170 UPDATE 4x4 0 0

Execution Trace

X Y 1 2 … … N-1 1 … … N-2 N-1 N-1 N-1 X Y 1 1 … … 1

  • N

1 … … 1 1

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

SCIENTIFIC APPROACH

Transformation:

slide-21
SLIDE 21

| 21

Source Code (C/C++ based DSL) Execution Trace Memory Signature for each

Code Instrumentation Transformation function

(a) (b) (res)

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

SCIENTIFIC APPROACH

slide-22
SLIDE 22

| 22

Source Code (C/C++ based DSL) Execution Trace Memory Signature for each Optimal Implementation

  • f each

Data Base of known access-pattern signatures

Code Instrumentation Transformation function I n j e c t

  • p

t i m a l i m p l e m e n t a t i

  • n
  • f

e a c h v a r i a b l e HW Memory, Cache Policy, Transformation Function

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

SCIENTIFIC APPROACH

Correlation:

slide-23
SLIDE 23

| 23

Source Code (C/C++ based DSL) Execution Trace Memory Signature for each Optimal Implementation

  • f each

Code Instrumentation Transformation function I n j e c t

  • p

t i m a l i m p l e m e n t a t i

  • n
  • f

e a c h v a r i a b l e

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

SCIENTIFIC APPROACH

Data Base of known access-pattern signatures

HW Memory, Cache Policy, Transformation Function

slide-24
SLIDE 24

| 24

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

OUTLINES

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

State of the art: Pattern detection, usage and DLD HARDSI: Hardware Adapted Restructuring of Data Structure Implementation Experimental Results Enhancing HARDSI with Data-cache modeling

slide-25
SLIDE 25

| 25

RESULTS Impact of the optimized implementation

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-26
SLIDE 26

| 26

RESULTS Impact of the optimized implementation: LLC

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-27
SLIDE 27

| 27

RESULTS Interest of HARDSI: finding the best known implementation

Riyane SID LAKHDAR et.al / CEA / SCOPES’20 [1] Mohamed Benabderrahmane et al. 2010. The polyhedral model is more widely applicable than you think. In CC. [10] Mahmut Kandemir et al. 2001. Dynamic management of scratch-pad memory space. In DAC. IEEE. [14] Alain M Leger et al. 1991. JPEG still picture compression algorithm. Optical Engineering (1991). [27] Qingxiong Yang. 2012. Recursive bilateral filtering. In ECCV.

slide-28
SLIDE 28

| 28

CONCLUSIONS & PERSPECTIVES

Summary:

  • A novel method to solve the data-layout decision problem
  • Maps accurately a memory-access pattern with the

corresponding optimized implementation

Perspectives:

  • Support new hardware memories (e.g. scratchpad, NVM, In-

memory computing)

  • Reduce the complexity to support more complexe code (ML, AI)

at compiler-scale time

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-29
SLIDE 29

Centre de Saclay Nano-Innov PC 172 - 91191 Gif sur Yvetue Cedex

slide-30
SLIDE 30

| 30

Find the ideal implementation of each variable's data-structure

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-31
SLIDE 31

| 31

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

State of the art: Pattern detection, usage and DLD HARDSI: Data-Structure Implementation selection Experimental Results Perspectives: Data-cache modeling

OUTLINES

slide-32
SLIDE 32

| 32

RESULTS

Use case example: JPEG compression

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

slide-33
SLIDE 33

| 33

Use case example: JPEG compression

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

RESULTS

slide-34
SLIDE 34

| 34

Use case example: JPEG compression

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

RESULTS

slide-35
SLIDE 35

| 35

Use case example: JPEG compression

Riyane SID LAKHDAR et.al / CEA / SCOPES’20

RESULTS