SLIDE 1
Pipeline Oriented Implementation of NORX for ARM Processors
Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio López jlopez@ic.unicamp.br November 7, 2017
Institute of Computating - UNICAMP LASCA
SLIDE 2 Table of contents
- 1. Introduction
- 2. Target architecture
- 3. NORX family of AEAD algorithms
- 4. Pipeline optimization
- 5. Results
- 6. Future work
1/37
SLIDE 3
Introduction
SLIDE 4 Authenticated encryption (with additional data)
- An AEAD scheme is an algorithm that uses a secret key and
public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02].
- Such a scheme is useful, for example, to encrypt the body
- f a message, keep a header in plaintext and authenticate
the whole.
Figure 1: Basic block design of an AEAD.
2/37
SLIDE 5 Authenticated encryption (with additional data)
Formally:
- An AEAD scheme is defined by Π = (K, E, D) and the
associated sets Nonce = {0, 1}n, Header ⊂ {0, 1}∗ and Message ⊆ {0, 1}∗.
- The keyspace K is a non-empty set of strings.
- The message M ∈ Message; The Nonce N ∈ Nonce; The
Header H ∈ Header.
- The encryption algorithm EN,H
K
(M) → C.
- The decryption algorithm DN,H
K
(C) → {M, ⊥}.
K
(EN,H
K
(M)) = M for all K ∈ K, N, H and M.
K
(M)| = ℓ(|M|) for some linear-time length function ℓ.
3/37
SLIDE 6 Cryptographic competitions: CAESAR
- CAESAR (2013, –) stands for ”Competition for Authenticated
Encryption: Security, Applicability, and Robustness” [CAE13].
- CAESAR aims to select a portfolio of AEAD ciphers, suited
for widespread adoption and that offer advantages over NIST’s AES-GCM.
- Following the footsteps of other cryptographic
competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms.
4/37
SLIDE 7 Cryptographic sponges
- A cryptographic sponge function is an algorithm with a
finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11].
- Sponges can be used to creat hash functions, MACs,
stream ciphers, RNGs and AEAD.
Figure 2: The basic design of a sponge function [BDPA11].
5/37
SLIDE 8
Target architecture
SLIDE 9 ARM processors
- The Advanced RISC Machine is a mainly 32-bit architecture
- wned by the British company ARM Holdings.
- With more than 100 billion chips deployes up to 2017, it is
- ne of the most widespread architectures nowadays.1
- ARM follows a load/store architecture, and mostly a single
clock cycle execution.
- In this work, we focused on the Cortex-A family: Cortex-A7,
Cortex-A15 and Cortex-A53.
1https://community.arm.com/processors/b/blog/posts/
inside-the-numbers-100-billion-arm-based-chips-1345571105
6/37
SLIDE 10 ARM processors: Target cores i
- Cortex-A7: The most efficient ARMv7-A core, with over a
billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with
- ther high-performance cores.
- Cortex-A15: A high-performance ARMv7-A core, well suited
to consumer items such as smartphones and embedded
- applications. As with other processors of the same line, it
is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations.
7/37
SLIDE 11 ARM processors: Target cores ii
- Cortex-A53: An ARMv8-A core capable of seamlessly
running both 32-bit and 64-bit code, and is made as an efficient 64-bit core for a low area and power footprint. Like the Cortex-A7, it is capable of being deployed together with high-end CPUs for chips with heterogeneous cores. The Cortex-A53 uses an efficient eight-stage, 2-way superscalar, in-order pipeline. Our tests were also carried on Cortex-M4, M3 and M0, for completeness.
8/37
SLIDE 12
NORX family of AEAD algorithms
SLIDE 13 NORX AEAD
- NORX is a family of AEAD algorithms, currently in the third
round of CAESAR.
- Based on a sponge design, it is a simple yet fast algorithm,
- ptimized for both 32-bit and 64-bit architectures.
- The design of NORX also allows for arbitrary parallelism in
the payload processing.
- Based on ARX2 primitives, NORX is optimized for both
software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access.
2Addition-Rotation-Xor
9/37
SLIDE 14 NORX AEAD
- The naming convention for NORX is NORXwlpt, where:
- w is the bit size of the words in the internal state.
- l is the number of rounds.
- p is the parallelism degree.
- t is the bitsize length of the authentication tag. When
t = 4w, it is omitted.
- The key length of NORX is k = 4w, therefore, the 32-bit
algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits.
10/37
SLIDE 15
NORX’s mode of operation i
The state is transformed in each step of the cipher using a non linear permutation Fℓ.
Figure 3: The layout of NORX.[AJN14].
11/37
SLIDE 16
NORX’s mode of operation ii
Figure 4: The layout of NORX, with parallel payload processing.[AJN14].
12/37
SLIDE 17 NORX’s core permutation
- The core of NORX is a 16-word internal state S, that can be
viewed as a 4 × 4 matrix: S = s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15
13/37
SLIDE 18
Pipeline optimization
SLIDE 19
Original permutation
The permutation can be visually represented as:
G() s0 s4 s8 s12 s1 s5 s9 s13 s2 s6 s10 s14 s3 s7 s11 s15 s12 s8 s13 s4 s9 s14 s0 s5 s10 s15 s1 s6 s11 s2 s7 s3 G() G() G() G() G() G() G()
Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14].
14/37
SLIDE 20 Original permutation
The Norx permutation is subdivided into a function G(), applied to the lines and diagonals of S: Algorithm 1 NORX F round function
1: function F 2: input: S, G() ▷ Norx State s0 · · · s15 and G() function 3: s0, s4, s8, s12 ← G(s0, s4, s8, s12) ▷ Processing the columns 4: s1, s5, s9, s13 ← G(s1, s5, s9, s13) 5: s2, s6, s10, s14 ← G(s2, s6, s10, s14) 6: s3, s7, s11, s15 ← G(s3, s7, s11, s15) 7: s0, s5, s10, s15 ← G(s0, s5, s10, s15) ▷ Processing the diagonals 8: s1, s6, s11, s12 ← G(s1, s6, s11, s12) 9: s2, s7, s8, s13 ← G(s2, s7, s8, s13) 10: s3, s4, s9, s14 ← G(s3, s4, s9, s14) 11:
12: end function
15/37
SLIDE 21 Original permutation
With G(a, b, c, d) being defined as: Algorithm 2 NORX G permutation function
1: function G 2: input: a, b, c, d ▷ Four words of the State 3: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 4: d ← (a ⊕ d) ≫ r0 5: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 6: b ← (c ⊕ b) ≫ r1 7: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 8: d ← (a ⊕ d) ≫ r2 9: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 10: b ← (c ⊕ b) ≫ r3 11:
12: end function
How can we improve the this permutation?
16/37
SLIDE 22
Code profiling
A synthetic test, using encryptions of random data was profiled for identification of hotspots. roundF is the best target for optimization.
Figure 6: Profiling results
17/37
SLIDE 23 Optimizing the F() function
The G() function can be split and reorganized in order to better use the processor’s pipeline:
G2() s1 s5 s9 s13 s2 s6 s10 s14 s3 s7 s11 s15 s12 s8 s13 s4 s9 s14 s0 s5 s10 s15 s1 s6 s11 s2 s7 s3 s4 s8 s12 s0 G2() G2() G2()
Figure 7: Column and diagonal steps, with two way pipeline
- ptimization. Notice that each call to function G2() operates over 8
words.
18/37
SLIDE 24
- ptimizing the F() function
Or even further, with a single G() function operating in the whole state at once:
G4() s1 s5 s9 s13 s2 s6 s10 s14 s3 s7 s11 s15 s4 s8 s12 s0 G4() s1 s s s s2 s
6
s s
9
s3 s
7
s s
13 10 5 14 15
s
4
s
11 8
s
12
s0
Figure 8: Column and diagonal steps, with four way pipeline
- ptimization, operating over the whole state at once.
19/37
SLIDE 25 Additional optimizations
A few extra steps were taken to improve code performance:
- Extensive use of preprocessor macros and code inlining.
- Avoiding use of extra or temporary variables, encrypting
and decrypting in place.
- Initialization of the sponge via loads of constant values
instead of evaluating F2(0 ∥ 1 ∥ 2 ∥ · · · ∥ 15).
- Where possible, concatenate shift and rotate operations
together with arithmetic ones, as to allow the use of ARM’s barrel shifter. For example a=a+b<<2 will compile into ADD r1, r1, r2, LSL #2.
20/37
SLIDE 26
Results
SLIDE 27 Benchmark i
- Benchmarks were carried out on a Odroid XU4 device
running Arch Linux for Cortex-A7 and Cortex-A15.
- An Odroid-C2 device was used for tests with the 64-bit
Cortex-A53, also running Arch Linux.
- Codes were compiled with gcc 6.3.1.
21/37
SLIDE 28 Benchmark ii
- Each test consists of the encryption of random data with
lengths between 128 bytes and 1 megabyte, with a 128-bit key for NORX3261 and NORX3264, and a 256-bit key for NORX6461.
- Our ests were also carried out on Cortex M4, M3 and M0
for consistency.
- All measures were done using the processors’ cycle
counter.
22/37
SLIDE 29 Results i
Table 1: Cycles per byte for NORX encryption. Plaintext length of 256KiB on the 32-bit processors. Reference code from CAESAR [AJN15]. NORX3261 and NORX3264 uses a 128-bit key, and NORX6461 uses a 256-bit key.
4x pipe 2x pipe
Speedup NORX 3261 Cortex A7 29.45 29.70 24.72 26.99 16% Cortex A15 17.77 14.23 15.16 18.25 20% NORX 3264 Cortex A7 28.46 33.74 26.50 33.27 7% Cortex A15 16.88 15.26 15.37 18.21 10% NORX 6461 Cortex A7 48.52 50.09 46.65 17.81 4% Cortex A15 33.83 26.76 28.33 10.90 21% 23/37
SLIDE 30 Results ii
Table 2: Cycles per byte for NORX encryption on the 64-bit platform. Plaint text length of 256KiB. Reference code from CAESAR [AJN15]. NORX3261 and NORX3264 uses a 128-bit key, and NORX6461 uses a 256-bit key.
Ref. 4x pipe 2x pipe
Speedup NORX 3261 19.55 10.94 12.27 10.81 44% NORX 3264 19.42 12.08 13.06 9.56 38% NORX 6461 10.29 5.84 6.58 24.40 43%
24/37
SLIDE 31 Results iii
Table 3: Perfomance of NORX3261 (cycles per byte) on 32-bit Cortex-M architecture. Reference code from CAESAR [AJN15]
Cortex model Size
No pipeline
4x pipe 2x pipe M0 8KiB 100.12 99.52 111.84 99.96 M3 32KiB 50.49 49.96 67.21 66.26 M4 16KiB 50.49 49.96 47.28 66.26
25/37
SLIDE 32
Result charts: NORX 3261
Figure 9: Chart showing the results of NORX3261 on Cortex A15
26/37
SLIDE 33
Result charts: NORX 3264
Figure 10: Chart showing the results of NORX3264 on Cortex A15
27/37
SLIDE 34
Result charts: NORX 6461
Figure 11: Chart showing the results of NORX6461 on Cortex A53
28/37
SLIDE 35 Conclusion: 32-bit NORX i
- For the 32-bit variant of NORX, the 4× pipeline
implementation is 20% faster than the reference code on a 32-bit core, and 44% faster on a 64-bit core.
- Comparing with a NEON SIMD implementation, 2× is 12%
faster on a Cortex A7; and the 4× is 22% faster on the A15.
29/37
SLIDE 36 Conclusion: 32-bit NORX ii
- While NORX has a SIMD friendly construction, there are
two extra transformations needed in each application of F. Together with the extra cost to transfer data from NEON registers back to ARM registers, a pipelined implementation yields better performance.
- Multisponge implementations of NORX (p = 4), running on
a single core, show similar speedups.
30/37
SLIDE 37 Conclusion 64-bit NORX
- For the 64-bit variant of NORX, a 2× pipeline is better
suited for the Cortex-A7, and the 4× pipeline for Cortex-A15, mainly due to pipeline length.
- The 64-bit NORX has a better performance in 32-bit cores
using SIMD instructions. The 64-bit rotations are very straightfoward in NEON registers, compared to 32-bit registers.
- On the 64-bit processor, both pipelined implementations
are faster than the reference code and neon code: 4× is 39% faster, and 2× is 31% faster.
31/37
SLIDE 38
Conclusion
We presented an efficient software implementation of the NORX family of AEAD algorithms. In particular, the technique led to a positive performance impact on the target processors.
32/37
SLIDE 39
Future work
SLIDE 40 Future work: Blake2 hash function
Blake2 hashing algorithm has a strucure very similar to the NORX sponge state (4 × 4 word state), with a similar
- permutation. It is possible that applying the same techniques
can yields good results. Gblake = a ← a + b + mσr(2i) d ← (a ⊕ a) ≫ 32 c ← c + d b ← (b ⊕ c) ≫ 24 a ← a + b + mσr(2i+1) d ← (d ⊕ a) ≫ 16 c ← c + d b ← (b ⊕ c) ≫ 63
33/37
SLIDE 41
Aknowledgement
This work was supported by LG Electronics via the ”Crypto for IOT” project. The second author was partially supported by a research productivity grant from CNPq. We thank prof. Diego Aranha for helping with the benchmark platforms.
34/37
SLIDE 42
References i
Jean-Philippe Aumasson, Philipp Jovanovic, and Samuel Neves. NORX: parallel and scalable AEAD. In ESORICS (2), volume 8713 of Lecture Notes in Computer Science, pages 19–36. Springer, 2014. Jean-Philippe Aumasson, Philipp Jovanovic, and Samuel Neves. Norx reference implementations (software). https://github.com/norx/norx, 2015.
35/37
SLIDE 43
References ii
Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. Duplexing the sponge: single-pass authenticated encryption and other applications. IACR Cryptology ePrint Archive, 2011:499, 2011. Committee CAESAR. Competition for authenticated encryption: Security, applicability, and robustness. http://competitions.cr.yp.to, April 2013.
36/37
SLIDE 44
References iii
ARM Holdings. Processors cortex-a. http: //www.arm.com/products/processors/cortex-a, March 2017. Phillip Rogaway. Authenticated-encryption with associated-data. In ACM Conference on Computer and Communications Security, pages 98–107. ACM, 2002.
37/37
SLIDE 45
Thank you very much for your attention. Questions are welcome! luan@lasca.ic.unicamp.br
37/37