[PPT] - Pipeline Oriented Implementation of NORX for ARM Processors Luan PowerPoint Presentation

SLIDE 1

Pipeline Oriented Implementation of NORX for ARM Processors

Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio López jlopez@ic.unicamp.br November 7, 2017

Institute of Computating - UNICAMP LASCA

SLIDE 2

Introduction

SLIDE 4

Authenticated encryption (with additional data)

An AEAD scheme is an algorithm that uses a secret key and

public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02].

Such a scheme is useful, for example, to encrypt the body
f a message, keep a header in plaintext and authenticate

the whole.

Figure 1: Basic block design of an AEAD.

2/37

SLIDE 5

Authenticated encryption (with additional data)

Formally:

An AEAD scheme is defined by Π = (K, E, D) and the

associated sets Nonce = {0, 1}n, Header ⊂ {0, 1}∗ and Message ⊆ {0, 1}∗.

The keyspace K is a non-empty set of strings.
The message M ∈ Message; The Nonce N ∈ Nonce; The

Header H ∈ Header.

The encryption algorithm EN,H

K

(M) → C.

The decryption algorithm DN,H

K

(C) → {M, ⊥}.

It is required that DN,H

K

(EN,H

K

(M)) = M for all K ∈ K, N, H and M.

And |EN,H

K

(M)| = ℓ(|M|) for some linear-time length function ℓ.

3/37

SLIDE 6

Cryptographic competitions: CAESAR

CAESAR (2013, –) stands for ”Competition for Authenticated

Encryption: Security, Applicability, and Robustness” [CAE13].

CAESAR aims to select a portfolio of AEAD ciphers, suited

for widespread adoption and that offer advantages over NIST’s AES-GCM.

Following the footsteps of other cryptographic

competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms.

4/37

SLIDE 7

Cryptographic sponges

A cryptographic sponge function is an algorithm with a

finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11].

Sponges can be used to creat hash functions, MACs,

stream ciphers, RNGs and AEAD.

Figure 2: The basic design of a sponge function [BDPA11].

5/37

SLIDE 8

Target architecture

SLIDE 9

ARM processors

The Advanced RISC Machine is a mainly 32-bit architecture
wned by the British company ARM Holdings.
With more than 100 billion chips deployes up to 2017, it is
ne of the most widespread architectures nowadays.1
ARM follows a load/store architecture, and mostly a single

clock cycle execution.

In this work, we focused on the Cortex-A family: Cortex-A7,

Cortex-A15 and Cortex-A53.

1https://community.arm.com/processors/b/blog/posts/

inside-the-numbers-100-billion-arm-based-chips-1345571105

6/37

SLIDE 10

ARM processors: Target cores i

Cortex-A7: The most efficient ARMv7-A core, with over a

billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with

ther high-performance cores.
Cortex-A15: A high-performance ARMv7-A core, well suited

to consumer items such as smartphones and embedded

applications. As with other processors of the same line, it

is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations.

7/37

SLIDE 11

ARM processors: Target cores ii

Cortex-A53: An ARMv8-A core capable of seamlessly

running both 32-bit and 64-bit code, and is made as an efficient 64-bit core for a low area and power footprint. Like the Cortex-A7, it is capable of being deployed together with high-end CPUs for chips with heterogeneous cores. The Cortex-A53 uses an efficient eight-stage, 2-way superscalar, in-order pipeline. Our tests were also carried on Cortex-M4, M3 and M0, for completeness.

8/37

SLIDE 12

NORX family of AEAD algorithms

SLIDE 13

NORX AEAD

NORX is a family of AEAD algorithms, currently in the third

round of CAESAR.

Based on a sponge design, it is a simple yet fast algorithm,
ptimized for both 32-bit and 64-bit architectures.
The design of NORX also allows for arbitrary parallelism in

the payload processing.

Based on ARX2 primitives, NORX is optimized for both

software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access.

2Addition-Rotation-Xor

9/37

SLIDE 14

NORX AEAD

The naming convention for NORX is NORXwlpt, where:
w is the bit size of the words in the internal state.
l is the number of rounds.
p is the parallelism degree.
t is the bitsize length of the authentication tag. When

t = 4w, it is omitted.

The key length of NORX is k = 4w, therefore, the 32-bit

algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits.

10/37

SLIDE 15

NORX’s mode of operation i

The state is transformed in each step of the cipher using a non linear permutation Fℓ.

Figure 3: The layout of NORX.[AJN14].

11/37

SLIDE 16

NORX’s mode of operation ii

Figure 4: The layout of NORX, with parallel payload processing.[AJN14].

12/37

SLIDE 17

NORX’s core permutation

The core of NORX is a 16-word internal state S, that can be

viewed as a 4 × 4 matrix: S =      s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15     

13/37

SLIDE 18

Pipeline optimization

SLIDE 19

Original permutation

The permutation can be visually represented as:

G() s0 s4 s8 s12 s1 s5 s9 s13 s2 s6 s10 s14 s3 s7 s11 s15 s12 s8 s13 s4 s9 s14 s0 s5 s10 s15 s1 s6 s11 s2 s7 s3 G() G() G() G() G() G() G()

Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14].

14/37

SLIDE 20

Original permutation

The Norx permutation is subdivided into a function G(), applied to the lines and diagonals of S: Algorithm 1 NORX F round function

1: function F 2: input: S, G() ▷ Norx State s0 · · · s15 and G() function 3: s0, s4, s8, s12 ← G(s0, s4, s8, s12) ▷ Processing the columns 4: s1, s5, s9, s13 ← G(s1, s5, s9, s13) 5: s2, s6, s10, s14 ← G(s2, s6, s10, s14) 6: s3, s7, s11, s15 ← G(s3, s7, s11, s15) 7: s0, s5, s10, s15 ← G(s0, s5, s10, s15) ▷ Processing the diagonals 8: s1, s6, s11, s12 ← G(s1, s6, s11, s12) 9: s2, s7, s8, s13 ← G(s2, s7, s8, s13) 10: s3, s4, s9, s14 ← G(s3, s4, s9, s14) 11:

utput: S

12: end function

15/37

SLIDE 21

Original permutation

With G(a, b, c, d) being defined as: Algorithm 2 NORX G permutation function

1: function G 2: input: a, b, c, d ▷ Four words of the State 3: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 4: d ← (a ⊕ d) ≫ r0 5: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 6: b ← (c ⊕ b) ≫ r1 7: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 8: d ← (a ⊕ d) ≫ r2 9: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 10: b ← (c ⊕ b) ≫ r3 11:

utput: a, b, c, d

12: end function

How can we improve the this permutation?

16/37

SLIDE 22

Code profiling

A synthetic test, using encryptions of random data was profiled for identification of hotspots. roundF is the best target for optimization.

Figure 6: Profiling results

17/37

SLIDE 23

Optimizing the F() function

The G() function can be split and reorganized in order to better use the processor’s pipeline:

G2() s1 s5 s9 s13 s2 s6 s10 s14 s3 s7 s11 s15 s12 s8 s13 s4 s9 s14 s0 s5 s10 s15 s1 s6 s11 s2 s7 s3 s4 s8 s12 s0 G2() G2() G2()

Figure 7: Column and diagonal steps, with two way pipeline

ptimization. Notice that each call to function G2() operates over 8

words.

18/37

SLIDE 24

ptimizing the F() function

Or even further, with a single G() function operating in the whole state at once:

G4() s1 s5 s9 s13 s2 s6 s10 s14 s3 s7 s11 s15 s4 s8 s12 s0 G4() s1 s s s s2 s

6

s s

9

s3 s

7

s s

13 10 5 14 15

s

4

s

11 8

s

12

s0

Figure 8: Column and diagonal steps, with four way pipeline

ptimization, operating over the whole state at once.

19/37

SLIDE 25

Additional optimizations

A few extra steps were taken to improve code performance:

Extensive use of preprocessor macros and code inlining.
Avoiding use of extra or temporary variables, encrypting

and decrypting in place.

Initialization of the sponge via loads of constant values

instead of evaluating F2(0 ∥ 1 ∥ 2 ∥ · · · ∥ 15).

Where possible, concatenate shift and rotate operations

together with arithmetic ones, as to allow the use of ARM’s barrel shifter. For example a=a+b<<2 will compile into ADD r1, r1, r2, LSL #2.

20/37

SLIDE 26

Results

SLIDE 27

Benchmark i

Benchmarks were carried out on a Odroid XU4 device

running Arch Linux for Cortex-A7 and Cortex-A15.

An Odroid-C2 device was used for tests with the 64-bit

Cortex-A53, also running Arch Linux.

Codes were compiled with gcc 6.3.1.

21/37

SLIDE 28

Benchmark ii

Each test consists of the encryption of random data with

lengths between 128 bytes and 1 megabyte, with a 128-bit key for NORX3261 and NORX3264, and a 256-bit key for NORX6461.

Our ests were also carried out on Cortex M4, M3 and M0

for consistency.

All measures were done using the processors’ cycle

counter.

22/37

SLIDE 29

Results i

Table 1: Cycles per byte for NORX encryption. Plaintext length of 256KiB on the 32-bit processors. Reference code from CAESAR [AJN15]. NORX3261 and NORX3264 uses a 128-bit key, and NORX6461 uses a 256-bit key.

Ref. code

4x pipe 2x pipe

Ref. Neon

Speedup NORX 3261 Cortex A7 29.45 29.70 24.72 26.99 16% Cortex A15 17.77 14.23 15.16 18.25 20% NORX 3264 Cortex A7 28.46 33.74 26.50 33.27 7% Cortex A15 16.88 15.26 15.37 18.21 10% NORX 6461 Cortex A7 48.52 50.09 46.65 17.81 4% Cortex A15 33.83 26.76 28.33 10.90 21% 23/37

SLIDE 30

Results ii

Table 2: Cycles per byte for NORX encryption on the 64-bit platform. Plaint text length of 256KiB. Reference code from CAESAR [AJN15]. NORX3261 and NORX3264 uses a 128-bit key, and NORX6461 uses a 256-bit key.

Ref. 4x pipe 2x pipe

Ref. Neon

Speedup NORX 3261 19.55 10.94 12.27 10.81 44% NORX 3264 19.42 12.08 13.06 9.56 38% NORX 6461 10.29 5.84 6.58 24.40 43%

24/37

SLIDE 31

Results iii

Table 3: Perfomance of NORX3261 (cycles per byte) on 32-bit Cortex-M architecture. Reference code from CAESAR [AJN15]

Cortex model Size

Ref. code

No pipeline

ptimizations

4x pipe 2x pipe M0 8KiB 100.12 99.52 111.84 99.96 M3 32KiB 50.49 49.96 67.21 66.26 M4 16KiB 50.49 49.96 47.28 66.26

25/37

SLIDE 32

Result charts: NORX 3261

Figure 9: Chart showing the results of NORX3261 on Cortex A15

26/37

SLIDE 33

Result charts: NORX 3264

Figure 10: Chart showing the results of NORX3264 on Cortex A15

27/37

SLIDE 34

Result charts: NORX 6461

Figure 11: Chart showing the results of NORX6461 on Cortex A53

28/37

SLIDE 35

Conclusion: 32-bit NORX i

For the 32-bit variant of NORX, the 4× pipeline

implementation is 20% faster than the reference code on a 32-bit core, and 44% faster on a 64-bit core.

Comparing with a NEON SIMD implementation, 2× is 12%

faster on a Cortex A7; and the 4× is 22% faster on the A15.

29/37

SLIDE 36

Conclusion: 32-bit NORX ii

While NORX has a SIMD friendly construction, there are

two extra transformations needed in each application of F. Together with the extra cost to transfer data from NEON registers back to ARM registers, a pipelined implementation yields better performance.

Multisponge implementations of NORX (p = 4), running on

a single core, show similar speedups.

30/37

SLIDE 37

Conclusion 64-bit NORX

For the 64-bit variant of NORX, a 2× pipeline is better

suited for the Cortex-A7, and the 4× pipeline for Cortex-A15, mainly due to pipeline length.

The 64-bit NORX has a better performance in 32-bit cores

using SIMD instructions. The 64-bit rotations are very straightfoward in NEON registers, compared to 32-bit registers.

On the 64-bit processor, both pipelined implementations

are faster than the reference code and neon code: 4× is 39% faster, and 2× is 31% faster.

31/37

SLIDE 38

Conclusion

We presented an efficient software implementation of the NORX family of AEAD algorithms. In particular, the technique led to a positive performance impact on the target processors.

32/37

SLIDE 39

Future work

SLIDE 40

Future work: Blake2 hash function

Blake2 hashing algorithm has a strucure very similar to the NORX sponge state (4 × 4 word state), with a similar

permutation. It is possible that applying the same techniques

can yields good results. Gblake = a ← a + b + mσr(2i) d ← (a ⊕ a) ≫ 32 c ← c + d b ← (b ⊕ c) ≫ 24 a ← a + b + mσr(2i+1) d ← (d ⊕ a) ≫ 16 c ← c + d b ← (b ⊕ c) ≫ 63

33/37

SLIDE 41

Aknowledgement

This work was supported by LG Electronics via the ”Crypto for IOT” project. The second author was partially supported by a research productivity grant from CNPq. We thank prof. Diego Aranha for helping with the benchmark platforms.

34/37

SLIDE 42

References i

Jean-Philippe Aumasson, Philipp Jovanovic, and Samuel Neves. NORX: parallel and scalable AEAD. In ESORICS (2), volume 8713 of Lecture Notes in Computer Science, pages 19–36. Springer, 2014. Jean-Philippe Aumasson, Philipp Jovanovic, and Samuel Neves. Norx reference implementations (software). https://github.com/norx/norx, 2015.

35/37

SLIDE 43

References ii

Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. Duplexing the sponge: single-pass authenticated encryption and other applications. IACR Cryptology ePrint Archive, 2011:499, 2011. Committee CAESAR. Competition for authenticated encryption: Security, applicability, and robustness. http://competitions.cr.yp.to, April 2013.

36/37

SLIDE 44

References iii

ARM Holdings. Processors cortex-a. http: //www.arm.com/products/processors/cortex-a, March 2017. Phillip Rogaway. Authenticated-encryption with associated-data. In ACM Conference on Computer and Communications Security, pages 98–107. ACM, 2002.

37/37

SLIDE 45

Pipeline Oriented Implementation of NORX for ARM Processors

Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio López jlopez@ic.unicamp.br November 7, 2017

Institute of Computating - UNICAMP LASCA

Table of contents

1/37

Introduction

Authenticated encryption (with additional data)

public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02].

the whole.

Figure 1: Basic block design of an AEAD.

2/37

Authenticated encryption (with additional data)

Formally:

associated sets Nonce = {0, 1}n, Header ⊂ {0, 1}∗ and Message ⊆ {0, 1}∗.

Header H ∈ Header.

K

(M) → C.

K

(C) → {M, ⊥}.

K

(EN,H

K

(M)) = M for all K ∈ K, N, H and M.

K

(M)| = ℓ(|M|) for some linear-time length function ℓ.

3/37

Cryptographic competitions: CAESAR

Encryption: Security, Applicability, and Robustness” [CAE13].

for widespread adoption and that offer advantages over NIST’s AES-GCM.

competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms.

4/37

Cryptographic sponges

finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11].

stream ciphers, RNGs and AEAD.

Figure 2: The basic design of a sponge function [BDPA11].

5/37

Target architecture

ARM processors

clock cycle execution.

Cortex-A15 and Cortex-A53.

inside-the-numbers-100-billion-arm-based-chips-1345571105

6/37

ARM processors: Target cores i

billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with

to consumer items such as smartphones and embedded

is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations.

7/37

ARM processors: Target cores ii

8/37

NORX family of AEAD algorithms

NORX AEAD

round of CAESAR.

the payload processing.

software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access.

9/37

NORX AEAD

t = 4w, it is omitted.

algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits.

10/37

NORX’s mode of operation i

The state is transformed in each step of the cipher using a non linear permutation Fℓ.

Figure 3: The layout of NORX.[AJN14].

11/37

NORX’s mode of operation ii

Figure 4: The layout of NORX, with parallel payload processing.[AJN14].

12/37

NORX’s core permutation

viewed as a 4 × 4 matrix: S =      s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15     

13/37

Pipeline optimization

Original permutation

The permutation can be visually represented as:

G() s0 s4 s8 s12 s1 s5 s9 s13 s2 s6 s10 s14 s3 s7 s11 s15 s12 s8 s13 s4 s9 s14 s0 s5 s10 s15 s1 s6 s11 s2 s7 s3 G() G() G() G() G() G() G()

Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14].

14/37

Original permutation

The Norx permutation is subdivided into a function G(), applied to the lines and diagonals of S: Algorithm 1 NORX F round function

12: end function

15/37

Original permutation