CryptoManiac Slides borrowed with permission from Todd Austin and - - PDF document

cryptomaniac
SMART_READER_LITE
LIVE PREVIEW

CryptoManiac Slides borrowed with permission from Todd Austin and - - PDF document

CryptoManiac Slides borrowed with permission from Todd Austin and Lisa Wu University of Michigan Advanced Computer Architecture Laboratory Architectures Diminishing Return Staples of value we strive for High Speed Low Power Low


slide-1
SLIDE 1

1

CryptoManiac

Slides borrowed with permission from Todd Austin and Lisa Wu

University of Michigan Advanced Computer Architecture Laboratory

Architectureís Diminishing Return

ï Staples of value we strive forÖ

ï High Speed ï Low Power ï Low Cost

ï Tricks of the trade

ï Faster clock rates, via pipelining ï Higher instruction throughput, via ILP extraction

ï Strong evidence of diminishing return, PIII vs. P4

ï 22% less P4 inst throughput (0.35 vs. 0.45 SPECInt/MHz)

ï Less return ⇒ less value ⇒

slide-2
SLIDE 2

2

A Powerful Solution: Eschew Generality

ï Specialization limits the scope of a deviceís operation

ï Produces stronger properties and invariants ï Results in higher return optimizations ï Programmability preserves the flexibility regarded by GPPís

ï A natural fit for embedded designs

ï Where application domains are more likely restrictive ï Where cost and power are 1st order concerns

Speed, Efficiency Flexibility, Programmability

H/W designs General Purpose Processors General Purpose Processors + ISA Extensions Application Specific Processor

Cryptography

ï Definitions: ï encryption vs. decryption ï public-key cipher vs. secret-key cipher ï Public-secret key ciphers are the most commonly used

pl ai nt ext ci phert ext pl ai nt ext

f(x) g(x)

Publ i c Key Pri vat e Key pl ai nt ext

g(x) g(x)

ci phert ext pl ai nt ext Pri vat e Key Pri vat e Key

slide-3
SLIDE 3

3

SSL Session Breakdown Focus: Secret-Key Ciphers

authenticate private key

server client

https get

public

. . .

https recv close

private

SSL Characterization by Session Length

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1k 2k 4k 8k 16k 32k SSL Session Length (bytes) Relative Contribution to Run Time Public Other Private

average size of a single web object (21k)

Benchmark Suite

Cipher Key Size Blk Size Rnds/Blk Author Application 3DES 112 64 48 CryptSoft SSL, SSH Blowfish 128 64 16 CryptSoft Norton Utilities IDEA 128 64 8 Ascom PGP, SSH Mars 128 128 16 IBM AES Candidate RC4 128 8 1 CryptSoft SSL RC6 128 128 18 RSA Security AES Candidate Rijndael 128 128 10 Rijmen AES Standard Twofish 128 128 16 Counterpane AES Candidate

slide-4
SLIDE 4

4

Cipher Throughput Analysis

0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish

Alpha 21264 4W DF

ï Alpha 21264 vs. 4W ï All except Mars and Twofish were within 10% of the actual machine tests ï Mars 11%, Twofish 15% ï Alpha 21264 vs. DF ï Blowfish, IDEA, and RC6 are running within 20% of DF performance ï Mars 29%, Twofish 76% ï RC4 and Rijndael are

  • utliers

Characteristics of Cipher Kernels

ï Diffusion (goal of cryptography)

ï Goal is to randomly impress upon each group of output bits some information from each of the input bits ï Process needs to be reversible ï Should result in a random perturbation of each output bit with a probability > 50%

ï Cipher kernel loops run about 16 times on each block of data,mixing the data more an more reach round ï Cipher kernels have very little/to no parallelism

ï Usually a very long recurrence

slide-5
SLIDE 5

5

Breakdown of Cipher Operations

ï Rotates

ï Rotate the bits in a register

ï Modular Addition ï Modular Multiplication (2^N + 1 prime modulus operations) ï Substitutions

ï Table-based substitutions ï SBOX ñ a table of values indexed with plaintext (a byte) that produces the result of the key-parameterized function

ï General Permutations

ï XBOX ñ map N bits onto N buts with any arbitrary exchange of individual bits

Blowfish Cipher Kernel

for (ii=0; ii < BF_ROUNDS; ii++) { register BF_LONG tmp; r ^= p[ii+1]; r ^= (((s[(int)(l >> 24L)] + sbox[0x0100 + ((int)(l >> 16L) & 0xff)]) ^ sbox[0x0200 + ((int)(l >> 8L) & 0xff)]) + sbox[0x0300 + ((int)(l) & 0xff)]) & 0xffffffffL; tmp = r; r = l; l = tmp; } r ^= p[BF_ROUNDS+1];

slide-6
SLIDE 6

6

Cipher Bottleneck Analysis

ï Alias - impact of stalling loads in the pipeline until all ealier store addresses have been resolved ï Branch - effects of mispredictions ï Issue - impact of reducing issue width ï Mem - impact of introducing a realistic memory system ï Res - impact of limited functional unit resources ï Window - impact of a limited-size instruction window

Analysis of Bottlenecks in Cipher Kernels

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3DES Mars RC4 Rijndael Twofish Alias Branch Issue Mem Res Window All

Cipher Relative Run Time Cost Focus: Kernel Loop

10 20 30 40 50 60 70 80 90 100 16 64 256 1k 4k 16k 64k 256k 1M Session Length (in bytes) Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish

ï 3DES and IDEA are small even for 16 byte sessions ï Mars, RC4, RC6, Rijndael, and Twofish drop well below 10% for 4k+ byte sessions ï Blowfish is outlier, drops below 10%

  • nly for 64k+ byte

sessions

slide-7
SLIDE 7

7

Cipher Kernel Characterization

ï SBOX - substitutions ï XBOX - permutations ï IDEA, Mars, RC4, and RC6 rely on arithmetic computations; benefit from more resources (multiplies) and from faster operations (rotates) ï Blowfish, 3DES, Rijndael and Twofish rely on substitutions; benefit from increased memory bandwidth and accesses

Characterization of Cipher Kernel Operations

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Branch Mov Ld/St Xbox Sbox Mult Rotates Logical Arith

Architectural Extensions

ï All instructions are limited to two register input operands and one register output ï ROL and ROR (rotates) for 64 and 32-bit data types ï ROLX and RORX support a constant rotate of a register input, followed by an XOR with another register input ï MULMOD computes the modular multiplication of two register values modulo the value 0x10001 ï SBOX speeds the accessing of substitution tables with 256- entry tables and 32-bit contents ï SBOXSYNC synchronize the SBOX table with memory ï XBOX implements a portion of a full 64-bit permutation

slide-8
SLIDE 8

8

Crypto-Specific ISA

ï frequent SBOX substitutions ï X = sbox[(y >> c) & 0xff] ï X = sbox[ m[ j^c] [1] ] ï SBOX instruction eliminates address generation ï All SBOX tables are aligned to a 1k byte boundary ï Address generation becomes zero-latency bit concatenation ï Stores to SBOX storage are not visible by later SBOXís until ï An SBOXSYNC is executed ï An alias bit is set ï SBOX instruction ï Incorporates byte extract ï Speeds address generation ï Original 4-cycle operation becomes a 1-cycle CryptoManiac instruction

10 8 16 24 31

  • pcode

00 SBOX Table

Table Index

Crypto-Specific ISA (cont.)

ï Ciphers often mix logical/arithmetic operation

ï Excellent diffusion properties plus resistance to attacks

ï ISA supports instruction combining

ï Logical + ALU op, ALU op + Logical ï Eliminates dangling XORís

ï Reduces kernel loop critical paths by nearly 25%

ï Small (< 5%) increase in clock cycle time

slide-9
SLIDE 9

9

Performance of ISA Extensions

0.5 1 1.5 2 2.5 3 3.5 4 4.5 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish Orig/4W Opt/4W Opt/4W+ Opt/8W+ Opt/DF

CryptoManic ISA

ï bundle := <inst><inst><inst><inst> ï inst := <operation pair><dest><operand 1><operand 2><operand 3> ï

  • peration pair := <short><tiny>|<tiny><short>|<tiny><tiny>|<long><nop>

ï tiny := <xor> | <and> | <inc> | <signext> | <nop> ï short := <add> | <sub> | <rot> | <sbox> | <nop> ï long := <mul> | <mulmod>

Instruction Semantics

Add-Xor r4, r1, r2, r3 r4 <- (r1+r2)⊗r3 And-Rot r4, r1, r2, r3 r4 <- (r1&&r2)<<<r3 And-Xor r4, r1, r2, r3 r4 <- (r1&&r2)⊗r3

slide-10
SLIDE 10

10

The CryptoManiac Processor

ï A 4-wide 32-bit VLIW machine with no cache and a simple branch predictor ï Supports a triadic (three input operands) ISA that permits combining of most cryptographic operation pairs for better clock cycle utilization ï Can be combined into chip multiprocessor configurations for improved performance on workloads with inter-session and inter-packet parallelism

A Case Study: CryptoManiac

ï Efficient crypto-processor for private-key ciphers

ï Chip-multiprocessor design extract inter-session parallelism

ï A highly specialized and efficient design

ï Crypto-specific microarchitecture, ISA, compiler, and circuits

CM Proc CM Proc CM Proc Key Store

Request Scheduler

In Q Out Q

Encrypt/decrypt requests

. . .

Ciphertext/plaintext results

id session action dataÖ id session resultÖ Request Format Result Format

slide-11
SLIDE 11

11

Crypto-Specific Microarchitecture

ï Simple 4-wide 32-bit statically scheduled VLIW

ï No caches needed, small instruction and data RAMs ï 16-entry BTB predicts branches

ï Resulting design is small and efficient

B T B I M E M R F FU FU FU FU D ata M em InQ /O utQ Inte rfa ce K e ysto re In te rface IF ID /R F E X/M E M W B

Crypto-Specific Functional Unit

Pipelined 32-Bit MUL 1K Byte SBOX Cache 32-Bit Adder 32-Bit Rotator XOR AND Logical Unit XOR AND Logical Unit

{tiny} {short} {tiny} {long}

slide-12
SLIDE 12

12

Timing and Area Results

Timing and Area Estima tes fo r Va rious CryptoManiac Configurations

4W C

  • mb

3W C

  • mb

2W C

  • mb

4W NoComb Timi ng Estim ate

2.78 ns 2.66 ns 2.54 ns 2.76 ns

Area Estim ate

1.39mm x 1.39mm 1.33mm x 1.33mm 1.26mm x 1.26mm 1.3m m x 1.3m m

Power Estim ate

606 .37 mW 593 .51 mW 568 .50 mW 586 .86 mW

Synthes is Constraint

3 ns 3 ns 3 ns 3 ns

Cri tical Path

byps

  • lgc-

add

  • lgc

byps

  • lgc-

add

  • lgc

byps

  • lgc-

add

  • lgc

add

Crypto-Specific Compiler

ï Crypto-kernels

ï Small code size ï Deterministic dependencies ï Deterministic control ï Deterministic latency ï Little loop-level parallelism

ï Super-optimizer identifies

  • ptimal schedule

ï Generates all schedules ï Chooses best given constraints and goals

ï Focuses uArch evaluation Super-optimizer Ö

Eval Eval Max Scheduler

Optimal Schedule GCC Hand Code

slide-13
SLIDE 13

13

Blowfish Cipher Kernel

for (ii=0; ii < BF_ROUNDS; ii++) { register BF_LONG tmp; r ^= p[ii+1]; r ^= (((s[(int)(l >> 24L)] + sbox[0x0100 + ((int)(l >> 16L) & 0xff)]) ^ sbox[0x0200 + ((int)(l >> 8L) & 0xff)]) + sbox[0x0300 + ((int)(l) & 0xff)]) & 0xffffffffL; tmp = r; r = l; l = tmp; } r ^= p[BF_ROUNDS+1];

Scheduling Example: Blowfish

SBOX SBOX SBOX SBOX ADD XOR ADD XOR Sign Ext Load XOR SBOX SBOX SBOX SBOX SBOX Add-XOR Load Add XOR XOR-SignExt Takes a total of only 4 cycles to execute!

slide-14
SLIDE 14

14

Simulation Methdology

ï Use SimpleScalar as baseline processor ï Compiled original algorithms on the Alpha ï Broke simulation for the algorithms into 2 sections

  • 1. Startup and shutdown code
  • 2. Cipher Kernel

ï Converted the Alpha assembly code of the Cipher kernels into their own ISA ï CrytoManiac results could be captured by

ï SimpleScalar results running algorithm WITHOUT the Cipher kernel ï + Simulating the Cipher Kernel in a special interpreter

ï Or could be captured by

ï Modifying SimpleScalar to switch to Cipher Interpreter when a special instruction is fetched, and switch back when finished.

Encryption Performance

Encryption Rates

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 Blowfish 3DES IDEA MARS RC4 RC6 Rijndael Twofish

Alpha ISA+ ISA++ 4WC 3WC 2WC 4WNC T-3 HDTV OC-3 OC-12

slide-15
SLIDE 15

15

Special Case Studies: 3DES and Rijndael

Performance/Area Tradeoff

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0.00 500000.00 1000000.00 1500000.00 2000000.00 2500000.00 3000000.00 Area (um2) 3DES Rijndael 4W 2W 3W 4WN 4W 3W 2W 4WN 8W 8W

Conclusion

ï Two hardware/software-design techniques to improve the performance of secret-key cipher algorithms

ï Add instruction support for fast substitutions, general permutations, rotates, and modular arithmetic ï SBOX eliminates address generation ï Overall speedup of 59% over baseline machine w/ rotates ï Design an efficient 4-wide VLIW cryptographic co-processor called the CryptoManiac ï Instruction combining - efficient utilization of clock cycle ï Rijndael runs 2.25 times faster with 1/100th area and power of a 600MHz Alpha processor