Intel’s New AES Instructions
Enhanced Performance and Security
Shay Gueron
- Intel Corporation, Israel Development Center, Haifa, Israel
- University of Haifa, Israel
Intels New AES Instructions Enhanced Performance and Security Shay - - PowerPoint PPT Presentation
Intels New AES Instructions Enhanced Performance and Security Shay Gueron - Intel Corporation, Israel Development Center, Haifa, Israel - University of Haifa, Israel 1 Overview AES basics Performance hungry applications The
Intel’s New AES Instructions
Enhanced Performance and Security
Shay Gueron
Overview
AES Overview
4X10-14
“Rounds”
Shift Row Plain Text
Fast Software EncryptionSubByte (Sbox) Mix Columns
Slow Software EncryptionCipher Text Add Round Key Round key
AES Transformations
AES Encryption
Tmp = AddRoundKey (Data, Round_Key_Encrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = MixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round]) end loop Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14]) Result = Tmp
640/48/56 steps
AES Decryption (Equivalent Inverse Cipher)
Tmp = AddRoundKey (Data, Round_Key_Decrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = InvShiftRows (Tmp) Tmp = InvSubBytes (Tmp) Tmp = InvMixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [round]) end loop Tmp = InvShiftRows (Tmp) Tmp = InvSubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [10 or 12 or 14]) Result = Tmp
7Equivalent Inverse Cipher
AES-128 Key Expansion
AES-128 Key Expansion for (i = 0 .. 3) { w[i] = Cipher Key[i] } for (i = 4 .. 43) { temp = w[i-1] if (i mod 4 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } w[i] = w[i-4] xor temp }
8AES-256 Key Expansion Encrypt for (i = 0 .. 7) { w[i] = Cipher Key[i] } for (i = 8 .. 59) { temp = w[i-1] if (i mod 8 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } else if (i mod 8 = 4) { temp = SubWord(temp) } w[i] = w[i-8] xor temp }
Preparing the decryption key schedule
9 K0 K2 K1 K3 K4 K6 K5 K7 K8 K10 K9 K11 K12 K14 K13 K15Key0 Key1 Key2 Key3
K16 K18 K17 K19 K20 K22 K21 K23 K24 K26 K25 K27 K28 K30 K29 K31Key4 Key5 Key6 Key7
K32 K34 K33 K35Key8 Encrypt Keys
K36 K38 K37 K39Key9
K40 K42 K41 K43Key10 InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols
Encrypt Round Keys
For the Equivalent Inverse cipher: apply InvMixCols to Encrypt Round keys
Decrypt Round Keys
Performance hungry AES usage models
– E.g., Microsoft Bitlocker – Similar in Linux
Relevant to clinet and server platforms
CPU cache
Memory tradeoff for capacity and latency (and cost) Most instructions are in relation to memory (load and store) Cache = small and fast memory
Problem: in a multitasking environment memory access can be made implicitly data-dependent
Cache-based attacks (among others)
Theoretical attacks by Page:
– 2003: Tsunoo et al. on DES – 2004: Bernstein on first round of AES – 2006: Neve et al. on first and second round of AES
– 2005: Bertoni et al. on AES through SimpleScalar – 2005: Lauradoux et al. on AES – 2006: Acıiçmez et al. on AES
– 2005: Percival on RSA with multithreaded processors – 2005-06: Osvik, Shamir et al. on AES with multithreaded processors – 2005-06: Neve and Seifert on AES with single-threaded processors and last round attack
14Table based AES (e.g., OpenSSL)
Tables based easier accesses and operations on 32-bit proc. For AES encryption, 5 precomputed tables [1-byte] [4-byte] Composed from two tables S and S’ [1-byte] [1-byte] T0 = [S’,S,S,SS’] T1 = [SS’,S’,S,S] T2 = [S,SS’,S’,S] T3 = [S,S,SS’,S’] T4 = [S,S,S,S]
15/* round 1: */ t0 = T0[s0 >> 24] T1[(s1 >> 16) & 0xff] T2[(s2 >> 8) & 0xff] T3[s3 & 0xff] rcon[4]; t1 = T0[s1 >> 24] T1[(s2 >> 16) & 0xff] T2[(s3 >> 8) & 0xff] T3[s0 & 0xff] rcon[5]; t2 = T0[s2 >> 24] T1[(s3 >> 16) & 0xff] T2[(s0 >> 8) & 0xff] T3[s1 & 0xff] rcon[6]; t3 = T0[s3 >> 24] T1[(s0 >> 16) & 0xff] T2[(s1 >> 8) & 0xff] T3[s2 & 0xff] rcon[7]; /* round 2: */ …
Table based AES
T4 is used for the last round (no MixColumns) and for Key Expansion T4=
16lsb 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 msb 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76 1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0 2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15 3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75 4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84 5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf 6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8 7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2 8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73 9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db 10 e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79 11 e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08 12 ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a 13 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e 14 e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df 15 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16 each value repeated 4x
Exploiting OS scheduling
AES rounds are short vs context switch frequency Preemptive scheduling ability for a process to yield CPU before end of OS quantum 2 processes
start of OS quantum end of OS quantum (re)loading table and wait accessing tables
Cache sharing leakages
Two processes on the same processor: crypto and spy
table line: if loading time is short line not evicted long line evicted
18cache lines
Mitigation
– But they severely degrade performance
19AES New Instructions (AES-NI)
Four instructions to perform AES encryption and decryption
Two instructions to perform AES Key Expansion
AES Data Structure
S(0,0) S(0,3) S(0,2) S(0,1) S(1,0) S(1,3) S(1,2) S(1,1) S(2,0) S(2,3) S(2,2) S(2,1) S(3,0) S(3,3) S(3,2) S(3,1) X0 = S (3 ,0) S (2, 0) S (1, 0) S (0, 0) X1 = S (3, 1) S (2 ,1) S (1, 1) S (0, 1) X2 = S (3, 2) S (2, 2) S (1, 2) S (0, 2) X3 = S (3, 3) S (2, 3) S (1, 3) S (0, 3)
22lsb msb X1
32 63X2
64 95X3
96 127X0
31xmm1
X5
32 63X6
64 95X7
96 127X4
31xmm2/ m128
The State (xmm0) in matrix representation
State and Round Key in xmm0 and xmm2/m128
The 4 AES Round Instructions
AESENC xmm0, xmm2/m128
Tmp:= xmm0; Round Key:= xmm2/m128; Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); Tmp:= Mix Columns (Tmp); xmm0:= Tmp xor Round Key AESENCLAST xmm0, xmm2/m128 Tmp:= xmm0; Round Key:= xmm2/m128; Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); xmm0:= Tmp xor Round Key
23AESDEC xmm0, xmm2/m128
Tmp:= xmm0; Round Key:= xmm2/m128; Tmp:= Inverse Shift Rows (Tmp); Tmp:= Inverse Substitute Bytes (Tmp); Tmp:= Inverse Mix Columns (Tmp:=); xmm0:= Tmp xor Round Key
AESDECLAST xmm0, xmm2/m128
State := xmm0; Round Key := xmm2/m128 Tmp:= Inverse Shift Rows (State); Tmp:= Inverse Substitute Bytes (Tmp); xmm0:= Tmp xor Round Key
Two instructions for Key Expansion
AESIMC xmm0, xmm2/m128 RoundKey := xmm2/m128; xmm0 := InvMixColumns (RoundKey) AESKEYGENASSIST xmm0, xmm2/m128, imm8 Tmp := xmm2/m128 RCON[31-8] := 0; RCON[7-0] := imm8; X3[31-0] := Tmp[127-96]; X2[31-0] := Tmp[95-64]; X1[31-0] := Tmp[63-32]; X0[31-0] := Tmp[31-0]; xmm0 := [RotWord (SubWord (X3)) XOR RCON, SubWord (X3), Rotword (SubWord (X1)) XOR RCON, SubWord (X1)]
24AESKEYGENASSIST xmm0, xmm2/m128, imm8
25X3 X2 X1 X0 X3 X3 X1 X1
Duplicate
X3’ X3’ X1’ X1’
S-box
X3’’ X3’ X1’
Rotate Rotate Duplicate
X3’’’ X3’ X1’’’ X1’
XOR RCON
X1’’
XOR RCON S-box S-box S-box
Design for performance scalability
Tmp = AddRoundKey (Data, Round_Key_Encrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = MixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round]) end loop Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14]) Result = Tmp
27Can control last round via immediate
AES-128 Key Expansion
begin word temp for (i = 0 .. 3) { w[i] = Initial Key[i] } for (i = 4 .. 43) { temp = w[i-1] if (I mod 4 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } w[i] = w[i-4] xor temp } end
29AESKEYGENASSIST
AES-256 Key Expansion
word temp for (i = 0 .. 7) { w[i] = Initial Key[i] } for (i = 8 .. 59) { temp = w[i-1] if (i mod 8 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } else if (i mod 8 = 4) { temp = SubWord(temp) } w[i] = w[i-8] xor temp }
30AESKEYGENASSIST AESKEYGENASSIST
AESIMC xmm0, xmm2/m128
31 K0 K2 K1 K3 K4 K6 K5 K7 K8 K10 K9 K11 K12 K14 K13 K15Key0 Key1 Key2 Key3
K16 K18 K17 K19 K20 K22 K21 K23 K24 K26 K25 K27 K28 K30 K29 K31Key4 Key5 Key6 Key7
K32 K34 K33 K35Key8 Encrypt Keys
K36 K38 K37 K39Key9
K40 K42 K41 K43Key10 InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols
Encypt Round Keys
Equivalent Inverse cipher requires applying InvMixCols to Encrypt Round keys
Decrypt Round Keys
AES-128 Key Expansion
AESKEYGENASSIST xmm2, xmm0, 0x1 call key_expand_128 AESKEYGENASSIST xmm2, xmm0, 0x2 call key_expand_128 AESKEYGENASSIST xmm2, xmm0, 0x4 call key_expand_128 … … AESKEYGENASSIST xmm2, xmm0, 0x36 call key_expand_128
32key_expand_128: pshufd xmm2, xmm2, 0xff vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 ret
AES-192 Key Expansion
aeskeygenassist xmm2, xmm3, 0x1 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x2 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x4 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x8 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x10 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x20 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x40 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x80 call key_expansion_192
33key_expand_192:
key_expansion_192: pshufd xmm2, xmm2, 0x55 vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4pxor xmm0, xmm4 Pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 pshufd xmm2, xmm0, 0xff vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pxor xmm3, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 movdqu XMMWORD PTR [rcx], xmm3 add rcx, 0x8ret
AES-256 Key Expansion
aeskeygenassist xmm2, xmm3, 0x1 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x2 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x4 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x8 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x10 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x20 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x40 call key_expansion_256
34key_expand_1256:
key_expansion_256: pshufd xmm2, xmm2, 0xff vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 aeskeygenassist xmm4, xmm0, 0 pshufd xmm2, xmm4, 0xaa vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4
Encrypting with AES round instructions
AES-128 ECB mode example
Round keys already expanded for i form 1 to N_BLOCKS do xmm0 = BLOCK [i] // load next data process for j from 1 to 9 do xmm0 = AESENC (xmm0, RK [j]) end xmm0 = AESENCLAST (xmm0, RK [10]) store xmm0 end
35AES-128 assembler (encryption and decryption)
36AES-128 decryption pxor xmm0, xmm02 AESDEC xmm0, xmm01 AESDEC xmm0, xmm00 AESDEC xmm0, xmm9 AESDEC xmm0, xmm8 AESDEC xmm0, xmm7 AESDEC xmm0, xmm6 AESDEC xmm0, xmm5 AESDEC xmm0, xmm4 AESDEC xmm0, xmm3 AESDECLAST xmm0, xmm2 Decryption Round Keys AESIMC xmm3, xmm3 AESIMC xmm4, xmm4 AESIMC xmm5, xmm5 AESIMC xmm6, xmm6 AESIMC xmm7, xmm7 AESIMC xmm8, xmm8 AESIMC xmm9, xmm9 AESIMC xmm00, xmm00 AESIMC xmm01, xmm01 AES-128 encryption pxor xmm0, xmm2 AESENC xmm0, xmm3 AESENC xmm0, xmm4 AESENC xmm0, xmm5 AESENC xmm0, xmm6 AESENC xmm0, xmm7 AESENC xmm0, xmm8 AESENC xmm0, xmm9 AESENC xmm0, xmm00 AESENC xmm0, xmm01 AESENCLAST xmm0, xmm02
Software Flexibility: modes of operation
37ECB (Encrypt)
38get next plaintext block AES encrypt store result into memory as ciphertext block more data
YES NO DONE
Use AES-NI building blocks
CBC (Encrypt)
39initialize feedback register with IV get next plaintext block XOR with feedback register AES encrypt store result into feedback register store result into memory as ciphertext block more data
YES NO DONE
Use AES-NI building blocks
CTR (Encrypt)
40initialize counter register with IV get counter register XOR with next plaintext block AES encrypt increment counter register store result into memory as ciphertext block more data
YES NO DONE
Use AES-NI building blocks
GCM
41data 1 ciphertext 1 data 2 ciphertext 2 data 3 ciphertext 3 hash 1
⊕
multiply with hash key in GF(2128) hash 0 hash 2
⊕
multiply with hash key in GF(2128) etc… AES CTR computation
Galois hash
Use AES-NI building blocks
Software Development tools
43C/C++ program icl /arch:AVX <filename> Executable binary Program output sde -- <binary name> Prior to silicon
Program output today
Software Development Emulator
Running the Basic Emulator
sde -- foo.exe <foo options>
For ease of use
% sde -help Usage: sde [args] -- application [application-args]
Default is "mix.out")
Default is "debugtrace.out")
Default is "avx-sse-transition.out")
Compiler support for using AES-NI
Compiler support through
extern __m128i __cdecl _mm_aesdec_si128(__m128i v, __m128i rkey); … extern __m128i __cdecl _mm_clmulepi64_si128(__m128i v1, __m128i v2, const int imm8);
wmmintrin.h
#include <ia32intrin.h> __m128i x, y, z; z = _mm_aesdec_si128(x, y);
User program
AES-128 CBC Encryption (Intrinsics)
void AES_128_CBC_Encrypt () { int i, j, k; __m128i tmp, feedback; __m128i RKEY [11]; for (k=0; k<11; k++) RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Schedule [4*k]); }; feedback = _mm_load_si128 ( (__m128i*)&IV [0]); for(i=0; i < NBLOCKS; i++) { tmp = _mm_load_si128 ( (__m128i*)&PLAINTEXT[i*4]); tmp = _mm_xor_si128 (tmp,feedback); tmp = _mm_xor_si128(tmp, RKEY[0]); for(j=1; j <10; j++) { tmp = _mm_aesenc_si128 (tmp, RKEY [j]); }; tmp = _mm_aesenclast_si128 (tmp, RKEY [10]); feedback = tmp; _mm_store_si128 ((__m128i*)&CIPHERTEXT[4*i], tmp); } }
46Parallelization
– Can apply the loop reversal technique
– Leading usage model is Bitlocker (and disk encryption in general) – CBC-encrypt throughput is less sensitive in because
the existing architecture
– As long as #registers ≥ Latency of instrcution
48Straightforward AES
for i form 1 to N_BLOCKS do xmm0 = BLOCK [i] // load xmm0 = AESENC (xmm0, RK [1]) xmm0 = AESENC (xmm0, RK [2]) xmm0 = AESENC (xmm0, RK [3]) … xmm0 = AESENC (xmm0, RK [9]) xmm0 = AESENCLAST (xmm0, RK [10]) store xmm0 end
49Wait L cycles
Performance: (10 x Latency) cycles / 16B
Efficient Usage of AES-NI: Loop Reversal
for i from 0 to N_BLOCKS/8 -1 do xmm0 = BLOCK [8*i+1], xmm2 = BLOCK [8*i+2]; … xmm8 = BLOCK [8*i+8] xmm0 = AESENC (xmm0, RK [1]) xmm2 = AESENC (xmm2, RK [1]) xmm3 = AESENC (xmm2, RK [1]) … xmm8 = AESENC (xmm8, RK [1]) xmm0 = AESENC (xmm0, RK [2]) xmm2 = AESENC (xmm2, RK [2]) … xmm8 = AESENC (xmm8, RK [2]) … xmm0 = AESENCLAST (xmm0, RK [10]) xmm2 = AESENCLAST (xmm2, RK [10]) … xmm8 = AESENCLAST (xmm8, RK [10]) store xmm0; store xmm0; … store xmm8 end
50L cycles elapse – ready
No need to wait
Scheduling the flow to space dependent AES-Ni by more than L cycles Effectively, dispatch an AES-NI every cycle
Throughput: 80 cycles / (8*16B) Gain speedup factor of L
Parallel modes of
pipelines hardware implementation of the AES-NI allow for re- scheduling the flow in a way that dependent AES-NI’s are spaced to hide the latency of one instruction
Parallelizing CBC encryption
void AES_128_CBC_Encrypt_Parallel_4_Blocks () { int i, j, k; __m128i tmp, feedback, feedback1, feedback2, __m128i feedback3, feedback4; __m128i tmp1, tmp2, tmp3, tmp4; __m128i RKEY [11]; for (k=0; k<11; k++){ RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Sched [4*k]); }; feedback1 = _mm_load_si128 ( (__m128i*)&IV1 [0]); feedback2 = _mm_load_si128 ( (__m128i*)&IV2 [0]); feedback3 = _mm_load_si128 ( (__m128i*)&IV3 [0]); feedback4 = _mm_load_si128 ( (__m128i*)&IV4 [0]); for(i=0; i < NBLOCKS; i++) tmp1 = _mm_load_si128 ( (__m128i*)&PLAINTEXT1[i*4]); tmp2 = _mm_load_si128 ( (__m128i*)&PLAINTEXT2[i*4]); tmp3 = _mm_load_si128 ( (__m128i*)&PLAINTEXT3[i*4]); tmp4 = _mm_load_si128 ( (__m128i*)&PLAINTEXT4[i*4]); 51 tmp1 = _mm_xor_si128 (tmp1, feedback1); tmp2 = _mm_xor_si128 (tmp2, feedback2); tmp3 = _mm_xor_si128 (tmp3, feedback3); tmp4 = _mm_xor_si128 (tmp4, feedback4); tmp1 = _mm_xor_si128(tmp1,RKEY[0]); tmp2 = _mm_xor_si128(tmp2,RKEY[0]); tmp3 = _mm_xor_si128(tmp3,RKEY[0]); tmp4 = _mm_xor_si128(tmp4,RKEY[0]); for(j=1; j <10; j++) { tmp1 = _mm_aesenc_si128 (tmp1, RKEY [j]); tmp2 = _mm_aesenc_si128 (tmp2, RKEY [j]); tmp3 = _mm_aesenc_si128 (tmp3, RKEY [j]); tmp4 = _mm_aesenc_si128 (tmp4, RKEY [j]); }; tmp1 = _mm_aesenclast_si128 (tmp1, RKEY [10]); tmp2 = _mm_aesenclast_si128 (tmp2, RKEY [10]); tmp3 = _mm_aesenclast_si128 (tmp3, RKEY [10]); tmp4 = _mm_aesenclast_si128 (tmp4, RKEY [10]); feedback1 = tmp1; feedback2 = tmp2; feedback3 = tmp3; feedback4 = tmp4; _mm_store_si128 ((__m128i*)&CIPHERTEXT1[4*i], tmp1); _mm_store_si128 ((__m128i*)&CIPHERTEXT2[4*i], tmp2); _mm_store_si128 ((__m128i*)&CIPHERTEXT3[4*i], tmp3); _mm_store_si128 ((__m128i*)&CIPHERTEXT4[4*i], tmp4); } }Parallelization at a higher level:
multiple independent data streams in parallel
Performance projections
– On today’s silicon ~15 cycles/byte (OpenSSL) – 18 cycles/byte from MSFT on 2006 platform
– Side channel mitigation is built-in
Significant speedup
More on Software Flexibility
53Rijndael-256 (256b block size)
VPBLENDVB xmm3, xmm2, xmm0, xmm5 VPBLENDVB xmm4, xmm0, xmm2, xmm5 PSHUFB xmm3, xmm8 PSHUFB xmm4, xmm8 AESENC xmm0, xmm6 AESENC xmm2, xmm7
54“left” half of RIJNDAEL input state (columns 0-3), “right” half of RIJNDAEL input state (columns 4-7), “right” half of RIJNDAEL round key “left” half of RIJNDAEL round key
Mask: 0x03020d0c0f0e0908b0a050407060100 (account for ShiftRows)
VPBLENDVB mask selecting bytes 1-3, 6-7, 10-11, 15 of from 1st operand and all other bytes from 2nd operand
Isolating the AES Transformations
– But - each one of these transformations can be isolated by a proper combination and the use of the byte shuffling (PSHUFB instruction).
– Constructing cipher variants – Supporting possible future modifications in the AES standard – Using the AES primitives as building blocks for ciphers and for cryptographic hash functions.
NIST’s SHA-3 competition use AES rounds and/or AES transformations.
– E.g., LANE, SHAMATA, SHAvite-3, and Vortex
55Isolating the AES Transformations
Isolating ShiftRows PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500 Isolating InvShiftRows PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00 Isolating MixColumns AESDECLAST xmm0, 0x00000000000000000000000000000000 AESENC xmm0, 0x00000000000000000000000000000000 Isolating InvMixColumns AESENCLAST xmm0, 0x00000000000000000000000000000000 AESDEC xmm0, 0x00000000000000000000000000000000 Isolating SubBytes PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00 AESENCLAST xmm0, 0x00000000000000000000000000000000 Isolating InvSubBytes PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500 AESDECLAST xmm0, 0x00000000000000000000000000000000
56AESDECLAST xmm0, 0 Tmp:= Inverse Shift Rows (State); Tmp:= Inverse Substitute Bytes (Tmp); xmm0:= Tmp xor 0 = xmm0 AESENC xmm0, 0 Round Key:= 0 Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); Tmp:= Mix Columns (Tmp); xmm0:= Tmp xor 0
AES-NI and PCLMULQDQ: Latency and throughput
HSW: 7/1 BDW: 7/1 SKL: 4/1
HSW: 7/2 ; BDW: 7/1 SKL: 4/1
57Architecture Codenames: Westmere (WSM) Sandy bridge (SNB), Haswell (HSW), Broadwell (BDW), Skylake (SKL)
References
3.01. Intel Software Network.
09-22-v01.pdf
Lecture Notes in Computer Science: 5665, p. 51-66 (2009).
59