Intels New AES Instructions Enhanced Performance and Security Shay - - PowerPoint PPT Presentation

intel s new aes instructions
SMART_READER_LITE
LIVE PREVIEW

Intels New AES Instructions Enhanced Performance and Security Shay - - PowerPoint PPT Presentation

Intels New AES Instructions Enhanced Performance and Security Shay Gueron - Intel Corporation, Israel Development Center, Haifa, Israel - University of Haifa, Israel 1 Overview AES basics Performance hungry applications The


slide-1
SLIDE 1

Intel’s New AES Instructions

Enhanced Performance and Security

Shay Gueron

  • Intel Corporation, Israel Development Center, Haifa, Israel
  • University of Haifa, Israel
1
slide-2
SLIDE 2

Overview

  • AES basics
  • Performance hungry applications
  • The security issue
  • The AES instrcutions
  • Performance scalability
  • Basic usage
  • Software flexibility
  • Software tools
  • Performance and optimizations
  • More on software flexibility
  • And more…
2
slide-3
SLIDE 3

AES Basics

3
slide-4
SLIDE 4

AES Overview

4

X10-14

“Rounds”

Shift Row Plain Text

Fast Software Encryption

SubByte (Sbox) Mix Columns

Slow Software Encryption

Cipher Text Add Round Key Round key

slide-5
SLIDE 5

AES Transformations

  • AddRoundKey — 128b xor of State and round key
  • SubBytes — nonlinear bytewise substitution (repeted 16x)
  • ShiftRows — bytewise permutation
  • MixColumns — matrix multiplication in GF(28)
  • InvSubBytes, InvShiftRows, InvMixColumns
  • SubWord – 4 x SubBytes
  • RotWord – [a0, a1, a2, a3]  [a1, a2, a3, a0]
  • Rcon – in round i equals [{02}i-1, {00}, {00}, {00}]
5
slide-6
SLIDE 6

AES Encryption

Tmp = AddRoundKey (Data, Round_Key_Encrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = MixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round]) end loop Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14]) Result = Tmp

6

40/48/56 steps

slide-7
SLIDE 7

AES Decryption (Equivalent Inverse Cipher)

Tmp = AddRoundKey (Data, Round_Key_Decrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = InvShiftRows (Tmp) Tmp = InvSubBytes (Tmp) Tmp = InvMixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [round]) end loop Tmp = InvShiftRows (Tmp) Tmp = InvSubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [10 or 12 or 14]) Result = Tmp

7

Equivalent Inverse Cipher

slide-8
SLIDE 8

AES-128 Key Expansion

AES-128 Key Expansion for (i = 0 .. 3) { w[i] = Cipher Key[i] } for (i = 4 .. 43) { temp = w[i-1] if (i mod 4 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } w[i] = w[i-4] xor temp }

8

AES-256 Key Expansion Encrypt for (i = 0 .. 7) { w[i] = Cipher Key[i] } for (i = 8 .. 59) { temp = w[i-1] if (i mod 8 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } else if (i mod 8 = 4) { temp = SubWord(temp) } w[i] = w[i-8] xor temp }

slide-9
SLIDE 9

Preparing the decryption key schedule

9 K0 K2 K1 K3 K4 K6 K5 K7 K8 K10 K9 K11 K12 K14 K13 K15

Key0 Key1 Key2 Key3

K16 K18 K17 K19 K20 K22 K21 K23 K24 K26 K25 K27 K28 K30 K29 K31

Key4 Key5 Key6 Key7

K32 K34 K33 K35

Key8 Encrypt Keys

K36 K38 K37 K39

Key9

K40 K42 K41 K43

Key10 InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols

Encrypt Round Keys

For the Equivalent Inverse cipher: apply InvMixCols to Encrypt Round keys

Decrypt Round Keys

slide-10
SLIDE 10

Performance Hungry Applications

10
slide-11
SLIDE 11

Performance hungry AES usage models

  • SSL/TLS for HTTPS
  • IPSec
  • OS Based Disk Encryption

– E.g., Microsoft Bitlocker – Similar in Linux

  • File encryption utilities
  • Storage Encryption
  • Voice Over IP Security (VOIP)
11

Relevant to clinet and server platforms

slide-12
SLIDE 12

The Security Issue

12
slide-13
SLIDE 13

CPU cache

Memory  tradeoff for capacity and latency (and cost) Most instructions are in relation to memory (load and store) Cache = small and fast memory

  • working close to CPU’s frequency
  • hiding the latency of larger large memories
  • Speculative: holds “next” required data
13

Problem: in a multitasking environment memory access can be made implicitly data-dependent

slide-14
SLIDE 14

Cache-based attacks (among others)

Theoretical attacks by Page:

  • Time-driven: execution time as function of cache-hit/miss numbers

– 2003: Tsunoo et al. on DES – 2004: Bernstein on first round of AES – 2006: Neve et al. on first and second round of AES

  • Trace-driven: sequence of cache-hit/miss

– 2005: Bertoni et al. on AES through SimpleScalar – 2005: Lauradoux et al. on AES – 2006: Acıiçmez et al. on AES

  • Access-driven: cache line accesses of crypto process

– 2005: Percival on RSA with multithreaded processors – 2005-06: Osvik, Shamir et al. on AES with multithreaded processors – 2005-06: Neve and Seifert on AES with single-threaded processors and last round attack

14
slide-15
SLIDE 15

Table based AES (e.g., OpenSSL)

Tables based  easier accesses and operations on 32-bit proc. For AES encryption, 5 precomputed tables [1-byte]  [4-byte] Composed from two tables S and S’ [1-byte]  [1-byte] T0 = [S’,S,S,SS’] T1 = [SS’,S’,S,S] T2 = [S,SS’,S’,S] T3 = [S,S,SS’,S’] T4 = [S,S,S,S]

15

/* round 1: */ t0 = T0[s0 >> 24]  T1[(s1 >> 16) & 0xff]  T2[(s2 >> 8) & 0xff]  T3[s3 & 0xff]  rcon[4]; t1 = T0[s1 >> 24]  T1[(s2 >> 16) & 0xff]  T2[(s3 >> 8) & 0xff]  T3[s0 & 0xff]  rcon[5]; t2 = T0[s2 >> 24]  T1[(s3 >> 16) & 0xff]  T2[(s0 >> 8) & 0xff]  T3[s1 & 0xff]  rcon[6]; t3 = T0[s3 >> 24]  T1[(s0 >> 16) & 0xff]  T2[(s1 >> 8) & 0xff]  T3[s2 & 0xff]  rcon[7]; /* round 2: */ …

slide-16
SLIDE 16

Table based AES

T4 is used for the last round (no MixColumns) and for Key Expansion T4=

16

lsb 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 msb 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76 1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0 2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15 3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75 4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84 5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf 6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8 7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2 8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73 9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db 10 e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79 11 e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08 12 ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a 13 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e 14 e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df 15 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16 each value repeated 4x

slide-17
SLIDE 17

Exploiting OS scheduling

AES rounds are short vs context switch frequency Preemptive scheduling  ability for a process to yield CPU before end of OS quantum 2 processes

  • spy continuously watches the cache accesses
  • crypto runs for small amounts a time
17

start of OS quantum end of OS quantum (re)loading table and wait accessing tables

slide-18
SLIDE 18

Cache sharing leakages

Two processes on the same processor: crypto and spy

  • 1. spy loads a (large) table
  • 2. crypto runs on the processor
  • 3. spy reloads and times each

table line: if loading time is short  line not evicted long  line evicted

18

cache lines

slide-19
SLIDE 19

Mitigation

  • There are way to write AES software and avoid the data-dependency
  • f memory accesses

– But they severely degrade performance

19
slide-20
SLIDE 20

Intel’s AES Instructions

20
slide-21
SLIDE 21

AES New Instructions (AES-NI)

  • Will be introduced into the Intel Instructions Set starting from 2009

Four instructions to perform AES encryption and decryption

  • AESENC – Perform one round encryption of AES
  • AESENCLAST – Perform last round encryption of AES
  • AESDEC – Perform one round decryption of AES
  • AESDECLAST – Perform last round decryption of AES

Two instructions to perform AES Key Expansion

  • AESKEYGENASSIST – Used for round key expansion
  • AESIMC – convert encryption round keys to a form usable for decryption
  • Intel’s architecture uses the equivalent inverse cipher
21
slide-22
SLIDE 22

AES Data Structure

S(0,0) S(0,3) S(0,2) S(0,1) S(1,0) S(1,3) S(1,2) S(1,1) S(2,0) S(2,3) S(2,2) S(2,1) S(3,0) S(3,3) S(3,2) S(3,1) X0 = S (3 ,0) S (2, 0) S (1, 0) S (0, 0) X1 = S (3, 1) S (2 ,1) S (1, 1) S (0, 1) X2 = S (3, 2) S (2, 2) S (1, 2) S (0, 2) X3 = S (3, 3) S (2, 3) S (1, 3) S (0, 3)

22

lsb msb X1

32 63

X2

64 95

X3

96 127

X0

31

xmm1

X5

32 63

X6

64 95

X7

96 127

X4

31

xmm2/ m128

The State (xmm0) in matrix representation

State and Round Key in xmm0 and xmm2/m128

slide-23
SLIDE 23

The 4 AES Round Instructions

AESENC xmm0, xmm2/m128

Tmp:= xmm0; Round Key:= xmm2/m128; Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); Tmp:= Mix Columns (Tmp); xmm0:= Tmp xor Round Key AESENCLAST xmm0, xmm2/m128 Tmp:= xmm0; Round Key:= xmm2/m128; Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); xmm0:= Tmp xor Round Key

23

AESDEC xmm0, xmm2/m128

Tmp:= xmm0; Round Key:= xmm2/m128; Tmp:= Inverse Shift Rows (Tmp); Tmp:= Inverse Substitute Bytes (Tmp); Tmp:= Inverse Mix Columns (Tmp:=); xmm0:= Tmp xor Round Key

AESDECLAST xmm0, xmm2/m128

State := xmm0; Round Key := xmm2/m128 Tmp:= Inverse Shift Rows (State); Tmp:= Inverse Substitute Bytes (Tmp); xmm0:= Tmp xor Round Key

slide-24
SLIDE 24

Two instructions for Key Expansion

AESIMC xmm0, xmm2/m128 RoundKey := xmm2/m128; xmm0 := InvMixColumns (RoundKey) AESKEYGENASSIST xmm0, xmm2/m128, imm8 Tmp := xmm2/m128 RCON[31-8] := 0; RCON[7-0] := imm8; X3[31-0] := Tmp[127-96]; X2[31-0] := Tmp[95-64]; X1[31-0] := Tmp[63-32]; X0[31-0] := Tmp[31-0]; xmm0 := [RotWord (SubWord (X3)) XOR RCON, SubWord (X3), Rotword (SubWord (X1)) XOR RCON, SubWord (X1)]

24
slide-25
SLIDE 25

AESKEYGENASSIST xmm0, xmm2/m128, imm8

25

X3 X2 X1 X0 X3 X3 X1 X1

Duplicate

X3’ X3’ X1’ X1’

S-box

X3’’ X3’ X1’

Rotate Rotate Duplicate

X3’’’ X3’ X1’’’ X1’

XOR RCON

X1’’

XOR RCON S-box S-box S-box

slide-26
SLIDE 26

Performance Scalability

26
slide-27
SLIDE 27

Design for performance scalability

Tmp = AddRoundKey (Data, Round_Key_Encrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = MixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round]) end loop Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14]) Result = Tmp

27

Can control last round via immediate

slide-28
SLIDE 28

Basic Usage

28
slide-29
SLIDE 29

AES-128 Key Expansion

begin word temp for (i = 0 .. 3) { w[i] = Initial Key[i] } for (i = 4 .. 43) { temp = w[i-1] if (I mod 4 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } w[i] = w[i-4] xor temp } end

29

AESKEYGENASSIST

slide-30
SLIDE 30

AES-256 Key Expansion

word temp for (i = 0 .. 7) { w[i] = Initial Key[i] } for (i = 8 .. 59) { temp = w[i-1] if (i mod 8 = 0) { temp = SubWord(RotWord(temp)) xor Rcon } else if (i mod 8 = 4) { temp = SubWord(temp) } w[i] = w[i-8] xor temp }

30

AESKEYGENASSIST AESKEYGENASSIST

slide-31
SLIDE 31

AESIMC xmm0, xmm2/m128

31 K0 K2 K1 K3 K4 K6 K5 K7 K8 K10 K9 K11 K12 K14 K13 K15

Key0 Key1 Key2 Key3

K16 K18 K17 K19 K20 K22 K21 K23 K24 K26 K25 K27 K28 K30 K29 K31

Key4 Key5 Key6 Key7

K32 K34 K33 K35

Key8 Encrypt Keys

K36 K38 K37 K39

Key9

K40 K42 K41 K43

Key10 InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols InvMixCols

Encypt Round Keys

Equivalent Inverse cipher requires applying InvMixCols to Encrypt Round keys

Decrypt Round Keys

slide-32
SLIDE 32

AES-128 Key Expansion

AESKEYGENASSIST xmm2, xmm0, 0x1 call key_expand_128 AESKEYGENASSIST xmm2, xmm0, 0x2 call key_expand_128 AESKEYGENASSIST xmm2, xmm0, 0x4 call key_expand_128 … … AESKEYGENASSIST xmm2, xmm0, 0x36 call key_expand_128

32

key_expand_128: pshufd xmm2, xmm2, 0xff vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 ret

slide-33
SLIDE 33

AES-192 Key Expansion

aeskeygenassist xmm2, xmm3, 0x1 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x2 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x4 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x8 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x10 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x20 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x40 call key_expansion_192 aeskeygenassist xmm2, xmm3, 0x80 call key_expansion_192

33

key_expand_192:

key_expansion_192: pshufd xmm2, xmm2, 0x55 vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4pxor xmm0, xmm4 Pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 pshufd xmm2, xmm0, 0xff vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pxor xmm3, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 movdqu XMMWORD PTR [rcx], xmm3 add rcx, 0x8ret

slide-34
SLIDE 34

AES-256 Key Expansion

aeskeygenassist xmm2, xmm3, 0x1 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x2 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x4 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x8 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x10 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x20 call key_expansion_256 aeskeygenassist xmm2, xmm3, 0x40 call key_expansion_256

34

key_expand_1256:

key_expansion_256: pshufd xmm2, xmm2, 0xff vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 aeskeygenassist xmm4, xmm0, 0 pshufd xmm2, xmm4, 0xaa vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4

slide-35
SLIDE 35

Encrypting with AES round instructions

AES-128 ECB mode example

Round keys already expanded for i form 1 to N_BLOCKS do xmm0 = BLOCK [i] // load next data process for j from 1 to 9 do xmm0 = AESENC (xmm0, RK [j]) end xmm0 = AESENCLAST (xmm0, RK [10]) store xmm0 end

35
slide-36
SLIDE 36

AES-128 assembler (encryption and decryption)

36

AES-128 decryption pxor xmm0, xmm02 AESDEC xmm0, xmm01 AESDEC xmm0, xmm00 AESDEC xmm0, xmm9 AESDEC xmm0, xmm8 AESDEC xmm0, xmm7 AESDEC xmm0, xmm6 AESDEC xmm0, xmm5 AESDEC xmm0, xmm4 AESDEC xmm0, xmm3 AESDECLAST xmm0, xmm2 Decryption Round Keys AESIMC xmm3, xmm3 AESIMC xmm4, xmm4 AESIMC xmm5, xmm5 AESIMC xmm6, xmm6 AESIMC xmm7, xmm7 AESIMC xmm8, xmm8 AESIMC xmm9, xmm9 AESIMC xmm00, xmm00 AESIMC xmm01, xmm01 AES-128 encryption pxor xmm0, xmm2 AESENC xmm0, xmm3 AESENC xmm0, xmm4 AESENC xmm0, xmm5 AESENC xmm0, xmm6 AESENC xmm0, xmm7 AESENC xmm0, xmm8 AESENC xmm0, xmm9 AESENC xmm0, xmm00 AESENC xmm0, xmm01 AESENCLAST xmm0, xmm02

slide-37
SLIDE 37

Software Flexibility: modes of operation

37
slide-38
SLIDE 38

ECB (Encrypt)

38

get next plaintext block AES encrypt store result into memory as ciphertext block more data

YES NO DONE

Use AES-NI building blocks

slide-39
SLIDE 39

CBC (Encrypt)

39

initialize feedback register with IV get next plaintext block XOR with feedback register AES encrypt store result into feedback register store result into memory as ciphertext block more data

YES NO DONE

Use AES-NI building blocks

slide-40
SLIDE 40

CTR (Encrypt)

40

initialize counter register with IV get counter register XOR with next plaintext block AES encrypt increment counter register store result into memory as ciphertext block more data

YES NO DONE

Use AES-NI building blocks

slide-41
SLIDE 41

GCM

41

data 1 ciphertext 1 data 2 ciphertext 2 data 3 ciphertext 3 hash 1

multiply with hash key in GF(2128) hash 0 hash 2

multiply with hash key in GF(2128) etc… AES CTR computation

  • f the

Galois hash

Use AES-NI building blocks

slide-42
SLIDE 42

Software Tools

42
slide-43
SLIDE 43

Software Development tools

43

C/C++ program icl /arch:AVX <filename> Executable binary Program output sde -- <binary name> Prior to silicon

Program output today

Software Development Emulator

slide-44
SLIDE 44

Running the Basic Emulator

sde -- foo.exe <foo options>

For ease of use

  • Special command window where every command is run on the emulator

% sde -help Usage: sde [args] -- application [application-args]

  • mix (run mix histogram tool)
  • omix (set the output file name for mix, Implies -mix.

Default is "mix.out")

  • debugtrace (run mix debugtrace tool)
  • odebugtrace (set the output file name for debugtrace, Implies -debugtrace.

Default is "debugtrace.out")

  • ast (run the AVX/SSE transition checker)
  • oast (set the output file name for the AVX/SSE transition checker. Implies -ast.

Default is "avx-sse-transition.out")

  • no-avx (disable AVX emulation, just emulate AES+PCLMULQDQ+SSE4)
  • no-aes (disable AES+PCLMULQDQ+AVX emulation, just emulate SSE4)
  • pin-runtime (Use Pin's runtime libraries, required on some Linux* systems)
44
slide-45
SLIDE 45

Compiler support for using AES-NI

Compiler support through

  • Inline asm
  • Intrinsics
45

extern __m128i __cdecl _mm_aesdec_si128(__m128i v, __m128i rkey); … extern __m128i __cdecl _mm_clmulepi64_si128(__m128i v1, __m128i v2, const int imm8);

wmmintrin.h

#include <ia32intrin.h> __m128i x, y, z; z = _mm_aesdec_si128(x, y);

User program

slide-46
SLIDE 46

AES-128 CBC Encryption (Intrinsics)

void AES_128_CBC_Encrypt () { int i, j, k; __m128i tmp, feedback; __m128i RKEY [11]; for (k=0; k<11; k++) RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Schedule [4*k]); }; feedback = _mm_load_si128 ( (__m128i*)&IV [0]); for(i=0; i < NBLOCKS; i++) { tmp = _mm_load_si128 ( (__m128i*)&PLAINTEXT[i*4]); tmp = _mm_xor_si128 (tmp,feedback); tmp = _mm_xor_si128(tmp, RKEY[0]); for(j=1; j <10; j++) { tmp = _mm_aesenc_si128 (tmp, RKEY [j]); }; tmp = _mm_aesenclast_si128 (tmp, RKEY [10]); feedback = tmp; _mm_store_si128 ((__m128i*)&CIPHERTEXT[4*i], tmp); } }

46
slide-47
SLIDE 47

Performance and optimizations

47
slide-48
SLIDE 48

Parallelization

  • All useful modes are parallel (except for CBC-encrypt) are parallelizable
  • Blocks can be processed independently

– Can apply the loop reversal technique

  • The only serial mode in use is CBC encrypt

– Leading usage model is Bitlocker (and disk encryption in general) – CBC-encrypt throughput is less sensitive in because

  • Disk write latency is ok (CBC-encrypt)
  • Disk read is sensitive (CBC-encrypt is parallel)
  • Efficient SW technique can help squeeze out performance boost from

the existing architecture

  • The latency of the AES-NI does not matter too much

– As long as #registers ≥ Latency of instrcution

48
slide-49
SLIDE 49

Straightforward AES

for i form 1 to N_BLOCKS do xmm0 = BLOCK [i] // load xmm0 = AESENC (xmm0, RK [1]) xmm0 = AESENC (xmm0, RK [2]) xmm0 = AESENC (xmm0, RK [3]) … xmm0 = AESENC (xmm0, RK [9]) xmm0 = AESENCLAST (xmm0, RK [10]) store xmm0 end

49

Wait L cycles

Performance: (10 x Latency) cycles / 16B

slide-50
SLIDE 50

Efficient Usage of AES-NI: Loop Reversal

for i from 0 to N_BLOCKS/8 -1 do xmm0 = BLOCK [8*i+1], xmm2 = BLOCK [8*i+2]; … xmm8 = BLOCK [8*i+8] xmm0 = AESENC (xmm0, RK [1]) xmm2 = AESENC (xmm2, RK [1]) xmm3 = AESENC (xmm2, RK [1]) … xmm8 = AESENC (xmm8, RK [1]) xmm0 = AESENC (xmm0, RK [2]) xmm2 = AESENC (xmm2, RK [2]) … xmm8 = AESENC (xmm8, RK [2]) … xmm0 = AESENCLAST (xmm0, RK [10]) xmm2 = AESENCLAST (xmm2, RK [10]) … xmm8 = AESENCLAST (xmm8, RK [10]) store xmm0; store xmm0; … store xmm8 end

50

L cycles elapse – ready

No need to wait

Scheduling the flow to space dependent AES-Ni by more than L cycles Effectively, dispatch an AES-NI every cycle

Throughput: 80 cycles / (8*16B) Gain speedup factor of L

Parallel modes of

  • peration and fully

pipelines hardware implementation of the AES-NI allow for re- scheduling the flow in a way that dependent AES-NI’s are spaced to hide the latency of one instruction

slide-51
SLIDE 51

Parallelizing CBC encryption

void AES_128_CBC_Encrypt_Parallel_4_Blocks () { int i, j, k; __m128i tmp, feedback, feedback1, feedback2, __m128i feedback3, feedback4; __m128i tmp1, tmp2, tmp3, tmp4; __m128i RKEY [11]; for (k=0; k<11; k++){ RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Sched [4*k]); }; feedback1 = _mm_load_si128 ( (__m128i*)&IV1 [0]); feedback2 = _mm_load_si128 ( (__m128i*)&IV2 [0]); feedback3 = _mm_load_si128 ( (__m128i*)&IV3 [0]); feedback4 = _mm_load_si128 ( (__m128i*)&IV4 [0]); for(i=0; i < NBLOCKS; i++) tmp1 = _mm_load_si128 ( (__m128i*)&PLAINTEXT1[i*4]); tmp2 = _mm_load_si128 ( (__m128i*)&PLAINTEXT2[i*4]); tmp3 = _mm_load_si128 ( (__m128i*)&PLAINTEXT3[i*4]); tmp4 = _mm_load_si128 ( (__m128i*)&PLAINTEXT4[i*4]); 51 tmp1 = _mm_xor_si128 (tmp1, feedback1); tmp2 = _mm_xor_si128 (tmp2, feedback2); tmp3 = _mm_xor_si128 (tmp3, feedback3); tmp4 = _mm_xor_si128 (tmp4, feedback4); tmp1 = _mm_xor_si128(tmp1,RKEY[0]); tmp2 = _mm_xor_si128(tmp2,RKEY[0]); tmp3 = _mm_xor_si128(tmp3,RKEY[0]); tmp4 = _mm_xor_si128(tmp4,RKEY[0]); for(j=1; j <10; j++) { tmp1 = _mm_aesenc_si128 (tmp1, RKEY [j]); tmp2 = _mm_aesenc_si128 (tmp2, RKEY [j]); tmp3 = _mm_aesenc_si128 (tmp3, RKEY [j]); tmp4 = _mm_aesenc_si128 (tmp4, RKEY [j]); }; tmp1 = _mm_aesenclast_si128 (tmp1, RKEY [10]); tmp2 = _mm_aesenclast_si128 (tmp2, RKEY [10]); tmp3 = _mm_aesenclast_si128 (tmp3, RKEY [10]); tmp4 = _mm_aesenclast_si128 (tmp4, RKEY [10]); feedback1 = tmp1; feedback2 = tmp2; feedback3 = tmp3; feedback4 = tmp4; _mm_store_si128 ((__m128i*)&CIPHERTEXT1[4*i], tmp1); _mm_store_si128 ((__m128i*)&CIPHERTEXT2[4*i], tmp2); _mm_store_si128 ((__m128i*)&CIPHERTEXT3[4*i], tmp3); _mm_store_si128 ((__m128i*)&CIPHERTEXT4[4*i], tmp4); } }

Parallelization at a higher level:

  • perate on

multiple independent data streams in parallel

slide-52
SLIDE 52

Performance projections

  • Highly optimized software implementations of AES

– On today’s silicon ~15 cycles/byte (OpenSSL) – 18 cycles/byte from MSFT on 2006 platform

  • No side channel mitigation included
  • Mitigation in costly (no known real “protected implementation”)
  • With AES-NI:

– Side channel mitigation is built-in

Significant speedup

  • 2-3x in CBC encrypt in serial mode
  • More than 10x in parallel modes of operation
52
slide-53
SLIDE 53

More on Software Flexibility

53
slide-54
SLIDE 54

Rijndael-256 (256b block size)

VPBLENDVB xmm3, xmm2, xmm0, xmm5 VPBLENDVB xmm4, xmm0, xmm2, xmm5 PSHUFB xmm3, xmm8 PSHUFB xmm4, xmm8 AESENC xmm0, xmm6 AESENC xmm2, xmm7

54

“left” half of RIJNDAEL input state (columns 0-3), “right” half of RIJNDAEL input state (columns 4-7), “right” half of RIJNDAEL round key “left” half of RIJNDAEL round key

Mask: 0x03020d0c0f0e0908b0a050407060100 (account for ShiftRows)

VPBLENDVB mask selecting bytes 1-3, 6-7, 10-11, 15 of from 1st operand and all other bytes from 2nd operand

slide-55
SLIDE 55

Isolating the AES Transformations

  • AES-NI perform bundled sequences of AES transformations

– But - each one of these transformations can be isolated by a proper combination and the use of the byte shuffling (PSHUFB instruction).

  • Motivation

– Constructing cipher variants – Supporting possible future modifications in the AES standard – Using the AES primitives as building blocks for ciphers and for cryptographic hash functions.

  • Hashing: some of the new Secure Hash Function submissions to

NIST’s SHA-3 competition use AES rounds and/or AES transformations.

– E.g., LANE, SHAMATA, SHAvite-3, and Vortex

55
slide-56
SLIDE 56

Isolating the AES Transformations

Isolating ShiftRows PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500 Isolating InvShiftRows PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00 Isolating MixColumns AESDECLAST xmm0, 0x00000000000000000000000000000000 AESENC xmm0, 0x00000000000000000000000000000000 Isolating InvMixColumns AESENCLAST xmm0, 0x00000000000000000000000000000000 AESDEC xmm0, 0x00000000000000000000000000000000 Isolating SubBytes PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00 AESENCLAST xmm0, 0x00000000000000000000000000000000 Isolating InvSubBytes PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500 AESDECLAST xmm0, 0x00000000000000000000000000000000

56

AESDECLAST xmm0, 0 Tmp:= Inverse Shift Rows (State); Tmp:= Inverse Substitute Bytes (Tmp); xmm0:= Tmp xor 0 = xmm0 AESENC xmm0, 0 Round Key:= 0 Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); Tmp:= Mix Columns (Tmp); xmm0:= Tmp xor 0

slide-57
SLIDE 57

AES-NI and PCLMULQDQ: Latency and throughput

  • Micro-architectural enhancements:
  • Latency and throughput improve across CPU generations
  • (Latency and Throughput are measured in cycles)
  • The AES-NI AESENC/AESENCLAST, AESDEC/AESDECLAST
  • Latency/Throughput
  • WSM: 7/2; SNB: 8/1;

HSW: 7/1 BDW: 7/1 SKL: 4/1

  • PCLMULQDQ:
  • Latency/Throughput
  • SNB: 14/8 ;

HSW: 7/2 ; BDW: 7/1 SKL: 4/1

57

Architecture Codenames: Westmere (WSM) Sandy bridge (SNB), Haswell (HSW), Broadwell (BDW), Skylake (SKL)

slide-58
SLIDE 58

Backup

58
slide-59
SLIDE 59

References

  • S. Gueron. Intel Advanced Encryption Standard (AES) Instructions Set, Rev

3.01. Intel Software Network.

  • https://software.intel.com/sites/default/files/article/165683/aes-wp-2012-

09-22-v01.pdf

  • S. Gueron. Intel's New AES Instructions for Enhanced Performance and
  • Security. Fast Software Encryption, 16th International Workshop (FSE 2009),

Lecture Notes in Computer Science: 5665, p. 51-66 (2009).

59