[PPT] - Hardware Security Group at Lab-STICC 8 faculties and 12 PhD students PowerPoint Presentation

SLIDE 1

Arithmetic Tradeoffs on Performance/Cost/Security for Hardware Asymmetric Cryptography

Arnaud Tisserand

CNRS, Lab-STICC laboratory

CEA Seminar, July 2017

Hardware Security Group at Lab-STICC

8 faculties and ≈ 12 PhD students / postdocs / ATER / engineers

Hardware security for embedded systems:

◮ memory and communication protection ◮ secure OS with HW blocks, DIFT ◮ multicore / manycore security

Crypto implementations in hardware & embedded software:

◮ asymmetric (RSA, (H)ECC, PQC) ◮ arithmetic aspects (operators, libraries) ◮ homomorphic encryption

Secure hardware implementation:

◮ side channel and fault injection attacks and protections ◮ targets: FPGA and ASIC (reconfigurable, CGRA, ASIP) ◮ high-level synthesis (HLS) for security Lab-STICC: Laboratoire des Sciences et Techniques de l’Information, de la Communication et de la Connaissance MOCS: M´ ethodes, Outils, Circuits et Syst` emes Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 2/53

Skills (1/2)

Optimized hardware arithmetic operators:

◮ low-power operators (x ± y, x · y, 1/x, √x, 1/

x2 + y 2, n

i=0 xiyi, . . .)

◮ multiplication by constants (scalar, vector, matrix) ◮ advanced computation algorithms ◮ advanced representations of numbers ◮ function approximation (sin(x), cos(x), exp(x), log(x), tan(x), . . .) ◮ modular and finite fields arithmetic Fp and F2m ◮ fault tolerance (or detecting) operators ◮ FPGAs and ASICs targets

Tools for hardware arithmetic circuits:

◮ operators generators ◮ arithmetic circuits with bounded errors

Software arithmetic/computation libraries:

◮ (public-key) cryptography ◮ floating-point emulation on integer processor ◮ multiprecision computations (up to millions of bits) ◮ embedded processors, multi-cores and GPUs targets Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 3/53

Skills (2/2)

Hardware accelerators for crypto. applications:

◮ public-key crypto.: RSA, (H)ECC ◮ private-key: AES, (3)DES ◮ hash functions: SHAx (multi-mode)

Crypto-processor for (Hyper)-Elliptic Curve Cryptography:

◮ arithmetic operators over Fp and F2m, typically 100–600 bits ◮ optimized architectures, algorithms and number representations

Software libraries for arithmetic and cryptography:

◮ ECC library for GPUs and embedded processors ◮ RNS library for homomorphic encryption in multicores

Study and implementation of protections against physical attacks:

◮ Passive: power consumption, electromagnetic radiations, timings ◮ Active: fault injection (in progress)

Levels: arithmetic algorithms, numbers/objects representations,
perators, architectures, circuit optimizations
Trade-offs between: performance, cost (area/energy), security
True random number generators (TRNGs)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 4/53

SLIDE 2

(Hyper-)Elliptic Curve Cryptography (H)ECC

Finite field Fp:

integer arithmetic modulo large prime p

Elliptic curve over Fp:

E : y2 = x3 + ax + b

Points on the curve E:

P = (x1, y1), Q = (x2, y2), R = (x3, y3)

Set of points on E:

◮ finite (large # about p) ◮ “forms” an abelian group ◮ group law

addition on points

Two operations:

◮ Point addition: P + Q → R ◮ Point doubling: P + P = [2]P → R

Fp elements are very large: 100–600 bits!

y 2 = x3 + 4x + 20 over F1009

denoted ADD denoted DBL

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 5/53

Curve Level and Field Level Operations

point addition doubling tripling quintupling septupling . . . ADD DBL TPL QPL SPL . . . P + Q [2]P [3]P [5]P [7]P . . . if = = P = ±Q P + P P + P + P P + · · · + P P + · · · + P . . . Operation at curve level sequence of ≈10–20 Fq operations Fq operations: add/sub, multiplication M, square S, inversion I

[k]P ADD, DBL, . . . M, S, I in Fp

ne scalar multiplication

hundreds of curve op. thousands of field op.

clock cycles

DBL M . . . M S . . .

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 6/53

Costs of Curve Level Operations

Best computation costs from literature and curves over Fp

a curve-level operations −3 refs. ADD mADD DBL TPL QPL SPL = EFD 11M + 5S 7M + 4S 1M + 8S 5M + 10S

n. a.
n. a.

[18]

n. a.
n. a.

1M + 8S 5M + 10S 7M + 16S 15M + 24S [22] 11M + 5S 7M + 4S 2M + 8S 6M + 11S 9M + 15S 13M + 18S = EFD 11M + 5S 7M + 4S 3M + 5S 7M + 7S

n. a.
n. a.

[24] 11M + 5S 7M + 4S 3M + 5S 7M + 7S 11M + 11S 18M + 11S [23][22] 11M + 5S 7M + 4S 3M + 5S 7M + 8S 10M + 12S 14M + 15S refs. λDBL λTPL =

[14][15][20]

4λM + (4λ + 2)S (11λ − 1)M + (4λ + 2)S refs. λTPL / λ′DBL = [14][15] (11λ + 4λ′ − 1)M + (4λ + 4λ′ + 3)S

EFD: Explicit-Formulas Database http://hyperelliptic.org/EFD mADD : A + J − → J

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 7/53

ECC Scalar Multiplication

Q = [k]P = P + P + · · · + P

k times
main operation in ECC protocols
P ∈ E
k = (kn−1kn−2 . . . k1k0)2
n = 160–600 bits

Double-and-add scalar multiplication algorithm: 1: Q ← O 2: for i from n − 1 to 0 do 3: Q ← [2]Q (DBL) 4: if ki = 1 then Q ← Q + P (ADD) 5: return Q

Scans each bit of k and performs corresponding curve-level operation
Average cost: 0.5n ADD + n DBL (security

≈ 0.5n ones in k)

Security : Elliptic curve discrete logarithm problem (ECDLP)

given P and Q = [k]P, it is computationally unfeasible to obtain k

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 8/53

SLIDE 3

Basic Power Analysis Attack on ECC

encryption signature etc

protocol level

[k]P ADD(P, Q) DBL(P)

curve level

x±y x×y . . .

field level

circuit VDD GND I

traces

DBL DBL DBL DBL DBL DBL ADD ADD

0 0 0 1 1

Scalar multiplication operation for i from 0 to t − 1 do if ki = 1 then Q = ADD(P, Q) P = DBL(P)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 9/53

Accelerator Specifications

encryption signature etc

protocol level

HW SW HW

[k]P ADD(P, Q) DBL(P)

P + P curve level

x±y x×y . . .

field level

Performances =

⇒ hardware (HW)

◮ dedicated functional units ◮ internal parallelism

Limited cost (embedded systems)

◮ reduced silicon area ◮ low energy (& power consumption) ◮ large area used at each clock cycle

Flexibility =

⇒ software (SW)

◮ curves, algorithms, representations

(points/elements), k recoding, . . .

◮ at design time / at run time

Security against SCAs =

⇒ HW

◮ secure units (F2m, Fp) ◮ secure key storage/management ◮ secure control Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 10/53

Accelerator Architecture

external interface accelerator interconnect CTRL code mem. key mng. register file FU1 FU2 FU3

Data: w-bit (32, . . . , 128) except for k digits, control: a few bits per unit

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 11/53

Protected F2m Multipliers

Unprotected

50 100 150 200 250 100 200 300 400 500 #transitions cycles Mastrovito 233 200 225 250 cycles

Protected Overhead: Area/time < 10 % References: PhD D. Pamula [29] Articles: [32], [31], [30]

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 12/53

SLIDE 4

Protected (Old) Accelerator for F2m

100 200 300 50 100 150 200 250 300 350 #transit. cycles DBL operation Mastrovito Unprotected Activity trace 0.00 0.02 0.04 0.06 0.08 current [mA] DBL operation Mastrovito Unprotected Current measures 100 200 300 #transit. DBL operation Mastrovito Protected Activity trace 0.00 0.04 0.08 0.12 0.16 current [mA] DBL operation Mastrovito Protected Current measures 100 200 300 #transit. ADD operation Mastrovito Protected Activity trace

Warning: old dedicated accelerator (similar behavior is expected for our new one)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 13/53

Circuit-Level Protections for Arithmetic Operators

References: [12] and [13]

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 14/53

Comparison Architecture ECC 256 vs HECC 128 (1/2)

Implementations on Spartan 6 FPGAs without DSP slices

area [slices] time [ms]

ECC HECC

600 800 1000 1200 1400 1600 1800 2000 2200 5 10 15 20 25 30 5,4 5,2 5,1 4,4 4,2 4,1 3,4 3,2 3,1 2,4 2,2 2,1 1,4 1,2 1,1 12,2 12,1 11,2 11,1 10,2 10,1 9,2 9,1 8,2 8,1 7,2 7,1 6,2 6,1 5,2 5,1 4,2 4,1 3,2 3,1 2,2 2,1 1,2 1,1

On average HECC is 40 % faster than ECC for a similar silicon cost

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 15/53

Comparison Architecture ECC 256 vs HECC 128 (2/2)

% usage × area speedup

ECC HECC

20 40 60 80 100 1 2 3 1 2 3 4 5 1,1 1,2 1,4 2,4 3,4 4,4 1,1 1,2 2,1 3,1 3,2 5,2 8,2

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 16/53

SLIDE 5

ECC 256 vs Kummer-HECC 128 (similar theor. security)

Recent results presented at CryptArchi 2017 [17]. FPGA Version DSP BRAM Slices Freq. Nb. Time 18K (MHz) cycles (ms) V4 ECC 37 11 4655 250 109,297 0.44 HECC 1u 11 7 1413 330 183,051 0.55 HECC 2u 22 9 2356 330 115,211 0.35 V5 ECC 37 10 1725 291 109,297 0.38 HECC 1u 11 7 873 360 183,051 0.51 HECC 2u 22 9 1542 360 115,211 0.32 Gain 1u on V5: -70% DSPs, -30% BRAMs, -49% slices, +30% duration Gain 2u on V5: -40% DSPs, -10% BRAMs, -10% slices, -15% duration ECC results from [25]

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 17/53

Double-Base Number System

Standard radix-2 representation: k =

t−1

i=0

ki2i = kt−1

2t−1

kt−2

2t−2

. . .

k2

22

k1

21

k0

20 t explicit digits implicit weights

Digits: ki ∈ {0, 1}, typical size: t ∈ {160, . . . , 600} Double-Base Number System (DBNS): k =

n−1

j=0

kj2aj3bj = kn−1 an−1 bn−1 . . . . . . . . . k1 a1 b1 k0 a0 b0 n (2, 3)−terms explicit “digits” explicit ranks aj, bj ∈ N, kj ∈ {1} or kj ∈ {−1, 1}, size n ≈ log t DBNS is a very redundant and sparse representation:

1701 = (11010100101)2

1701 = 243 + 1458 = 2035 + 2136 = (1, 0, 5), (1, 1, 6) = 1728 − 27 = 2633 − 2033 = (1, 6, 3), (−1, 0, 3) = 729 + 972 = 2036 + 2235 = (1, 0, 6), (1, 2, 5) . . .

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 18/53

Faster Scalar Multiplication Algorithms

Representation of k impacts #operations recode k:

non-adjacent forms (NAF/NAF-w):

high-radix signed-digits representations increase #0s

double-base number systems (DBNS): x = n′

i=1 dibui 1 bvi 2 with di = ±1

b1 and b2 co-prime integers (typically (b1, b2) = (2, 3)) specific op.: point tripling [3]P = P + P + P denoted TPL decreasing exponents (Horner form) higher speed

multi-base number systems (MBNS):

more than two bases (co-prime integers), e.g. (2, 3, 5) and (2, 3, 5, 7) x = n′

i=1

di

l

j=1 bej,i j

with di = ±1

BUT those recoding methods require pre-computations:

NAF-w: pre-compute and store Pj = [j]P

∀j ∈ {3, 5, . . . , 2w−1 − 1}

DBNS/MBNS recoding is performed off-line

Remark: point subtraction (SUB) is as efficient as point addition

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 19/53

MBNS Recoding Work presented at ARITH 21

[11] T. Chabrier and A. Tisserand. On-the-fly multi-base recoding for ECC scalar multiplication without pre-computations. In A. Nannarelli, P.-M. Seidel, and P. T. P. Tang, editors, Proc. 21st Symposium on Computer Arithmetic (ARITH), pages 219–228, Austin, TX, U.S.A, April 2013. IEEE Computer Society. DOI: 10.1109/ARITH.2013.17 http://tel.archives-ouvertes.fr/hal-00772613 (PDF)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 20/53

SLIDE 6

Notations

k = (kn−1kn−2 . . . k1k0)2, k > 1, the n-bit scalar stored into t words
f w bits with w(t − 1) < n ≤ wt (i.e. last word may be 0-padded).

k(i) the ith word of k starting from least significant for 0 ≤ i < t

B the multi-base with l base elements (co-prime integers),

B = (b1, b2, . . . , bl)

predicate divisible(x, B) returns true if x is divisible by at least one

base element in B (false for other cases)

number x represented as the sum of terms x = n′

i=1

di

l

j=1 bej,i j

with di ∈ {0, ±1} and in Horner form
term (di, e1,i, e2,i, . . . , el,i) defined by di × l

j=1 bej,i j

in B (index i may be omitted when context is clear)

Q, P curve points and Q = [k]P scalar multiplication

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 21/53

Very Simple MBNS Unsigned Recoding Algorithm

Transforms k into a list of terms (LT) in Horner form 1: LT ← ∅ 2: while k > 1 do 3: if not

divisible(k, B)
then

(divisibility test) 4: d ← 1 5: k ← k − 1 6: else 7: d ← 0 8: for j from 1 to l do 9: ej ← 0 10: while k ≡ 0 mod bj do (divisibility test) 11: ej ← ej + 1 12: k ← k/bj (exact division) 13: LT ← LT ∪ (d, e1, e2, . . . , el) 14: return LT Remark: divisibility tests at line 3 and 10 are shared

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 22/53

Very Simple MBNS Scalar Multiplication Algorithm

MBNS recoding works in a serial way starting with most significant
each term can be immediately used in the scalar multiplication

recorded terms are processed and used on-the-fly

multi-base adaptation of standard left-to-right scalar multiplication

([19, Sec. 3.3.1])

1: Q ← O 2: foreach t in LT do (t = (d, e1, e2, . . . , el)) 3: Q ← Q+d × P (d ∈ {0, 1} ⇒ NOP/ADD) 4: for j from 1 to l do 5: P ←

bej

j

P

(DBL, TPL, QPL, . . . ) 6: Q ← Q + P 7: return Q

Remark 1: recoding and curve-level operations are overlapped Remark 2: P is modified over time, we cannot use mADD (time penalty) Remark 3: d = 0 is only possible for the very first term

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 23/53

Implementation of Divisibility Tests (1/2)

We use Pascal’s tapes, [33] (published in 1819), [35], values are 2i mod bj:

i bj 11 10 9 8 7 6 5 4 3 2 1 3 2 1 2 1 2 1 2 1 2 1 2 1 5 3 4 2 1 3 4 2 1 3 4 2 1 7 4 2 1 4 2 1 4 2 1 4 2 1

For bj = 3, the periodic sequence is (2 1)∗: k mod 3 = (. . . + 23k3 + 22k2 + 21k1 + k0) mod 3 = (. . . + 2k3 + k2 + 2k1 + k0) mod 3 =

(2k2i+1 + k2i)
α
mod 3.

For bj = 5, the periodic sequence is (3 4 2 1)∗

use 3 = 1 + 2

unsigned sum with additional inputs (FPGA)

use 3 ≡ −2 mod 5

signed sum with less operands

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 24/53

SLIDE 7

Implementation of Divisibility Tests (2/2)

To avoid complex decoding, we use w = lcm(2, 4, 3) = 12 and w = 24 . . .

t words w bits

k mem. k(i)

CTRL i w for bj = 3 3

reg.

R for bj = 3 divisible by 3 1 w for bj = 5 5

reg.

R for bj = 5 divisible by 5 1 w for bj = 7 4

reg.

R for bj = 7 divisible by 7 1

FPGA results for n = 160 (XC5VLX50T, ISE 12.4, std efforts S/P&R): area freq. clock w slices (FF/LUT) MHz cycles 12 25 (40/81) 543 t + 3 24 67 (53/152) 549 t + 4

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 25/53

Implementation of Exact Division by bj Elements (1/2)

Exact division k/bj: we know that k is divisible by bj Algorithm from [21] (LSWF), optimized for FPGA and bj ∈ {3, 5, 7}:

1: c ← 0 2: for i from 0 to t − 1 do 3: r ← k(i) − c 4: r ← r×(b−1

j

mod 2w) 5: c ← 0 6: for h from 1 to bj − 1 do 7: if r≥h × ⌈(2w − 1)/bj⌉ then 8: c ← c + 1 9: k(i) ← (rw−1 · · · r0) 10: return k

bj b−1

j

mod 212, γ b−1

j

mod 224, γ 3 (101010101011)2, 3 (101010101010101010101011)2, 4 5 (110011001101)2, 3 (110011001100110011001101)2, 4 7 (110110110111)2, 3 (110110110110110110110111)2, 4

We use our multiplication by constant algorithm [8]

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 26/53

Implementation of Exact Division by bj Elements (2/2)

. . .

t words w bits

k mem. k(i)

W R

CTRL

i w × (b3 mod 2w)

± seq.

cmp. b3

w × (b5 mod 2w)

± seq.

cmp. b5

w × (b7 mod 2w)

± seq.

cmp. b7

MUX1 MUX2

c

sel. bj

2

FPGA results for n = 160 (XC5VLX50T, ISE 12.4, std efforts S/P&R):

area freq. clock w slices (FF/LUT) MHz cycles 12 59 (138/171) 291 t + 4 24 152 (441/448) 202 t + 5

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 27/53

Unsigned Multiple-Base Recoding Unit

. . .

t words w bits

k mem. k(i)

CTRL i R W

−

DTD-2 1 exact division

3,5,7 div. test 3,5,7

3

MUX

global ctrl

FPGA results for n = 160 and B = (2, 3, 5, 7) (XC5VLX50T, ISE 12.4, std efforts S/P&R): area freq. w slices (FF/LUT) MHz 12 153 (301/412) 232 24 323 (682/908) 202 Remark: DTD-2 divisibility test and division by 21...v with v ≤ w

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 28/53

SLIDE 8

Example

87 = 0 + 31 × (1 + 2271)

time

CLO: res.: k : P : Q : P O

DT

3 87 TPL ADD 3P O 3P 3P /3 29

DT

∅

−

28 DBL DBL SPL ADD 6P 3P 12P 3P 84P 3P 84P 87P

DT

22, 7 /22 7 /7 1

DT

∅

−

Notations:

“CLO” denotes curve-level operations
DT denotes divisibility test, “res.” their results
“/bj” exact division by bj

Remark: very short latency at the very beginning (< 0.01 % of [k]P for n = 160 and even less for larger fields)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 29/53

Signed Digits Version: d ∈ {0, ±1}

Add a selection function S in the recoding algorithm: unsigned version 4: d ← 1 5: k ← k − 1 − → signed version 4: d ← S(k) 5: k ← k − d We compared 4 heuristic selection functions:

min

S k

not divisible

k − 1 k + 1 k′ k′′ divisibility tests & exact divisions

?

d = +1 d = −1

d

approx: approximated minimum value selection function
max nb div: maximum number of divisors selection function
min2: 2 steps minimum value selection function

1) (k − 1, k + 1) min − → (k′, k′′) 2) (k′ − 1, k′ + 1, k′′ − 1, k′′ + 1) min2 − → (ζ′, ζ′′, ζ′′′, ζ′′′′)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 30/53

approx Selection Function

S k

not divisible

k − 1 k + 1 k′ k′′ divisibility tests & exact divisions

?

d = +1 d = −1

d

Computing (k′, k′′) is expensive, so we try to get an approximation k′ ≈ δ′ = ⌊log2(k − 1)⌋ + 1

MSB position of k−1

−

l

j=1

e′

j log2(bj)

k′′ ≈ δ′′ = ⌊log2(k + 1)⌋ + 1

MSB position of k+1

−

l

j=1

e′′

j log2(bj)

1) Exponents e′

j and e′′ j are produced by the divisibility tests

2) Approximate constants: log2 3 ≈ 1.59, log2 5 ≈ 2.32, and log2 7 ≈ 2.81 δ′ = MSB(k − 1) − eb1=2 − 1.5eb2=3 − 2.25eb3=5 − 2.75eb4=7

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 31/53

Comparison of Selection Functions

For curves over Fp with a = −3:

1600 1700 1800 1900 2000 2100 2200 (2,7) (2,5) (2,3) (2,5,7) (2,3,7) (2,3,5) (2,3,5,7)(2,3,5,7,11) computation time [M] unsigned signed/max_nb_div signed/min signed/approx signed/min2

Average computation time (in M) of 10 000 scalar multiplications with 160-bit values Similar behavior for curves with a = −3

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 32/53

SLIDE 9

Complete FPGA Implementation Results

Signed recoding unit with approx heuristic: area freq. w slices (FF/LUT) MHz 12 173 (326/466) 232 24 345 (724/1 005) 202 ECC processor (modification from [10], collab. UCC crypto group): memory area freq. version type slices (FF/LUT) BRAM MHz small distributed 2 204 (3 971/6 816) 155 BRAM 1 793 (3 641/6 182) 6 155 large distributed 3 182 (4 668/7 361) 142 BRAM 2 427 (4 297/6 981) 6 142

small: Fp curves, n = 160, Jacob. coord., NAF/MBNS, 1 unit/op.
large: same with 4NAF/MBNS and 2 ±, 2 ×, 1 inv.

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 33/53

Timings Comparison

For n = 160 and a = −3:

pre-computations refs. methods perfs storage

perations

recoding dbl&add 1 985.3M ∅ ∅ ∅ NAF 1 723.0M ∅ ∅

n-the-fly & very cheap

3NAF 1 583.7M 1 pt 49.4M

n-the-fly & very cheap

4NAF 1 499.1M 3 pts 140.8M

n-the-fly & very cheap

[14] DBNS 1 863.0M ∅ ∅

ff-line & costly

[15] DBNS 1 722.3M ∅ ∅

ff-line & costly

[1] DBNS 1 558.4M 7 pts >150M

ff-line & costly

[16] DBNS 1 615.3M ∅ ∅

ff-line & costly
ur

(2, 3) MBNS 1 746.2M ∅ ∅

n-the-fly & small

(2, 3, 5) MBNS 1 679.9M ∅ ∅

n-the-fly & small

(2, 3, 5, 7) MBNS 1 670.4M ∅ ∅

n-the-fly & small

For n = 160 and a = −3: about 15% slower than best DBNS/MBNS (theoretical) solutions

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 34/53

Addition Chains Recoding Work presented at Compas 2015

[34] J. Proy, N. Veyrat-Charvillon, A. Tisserand, and N. Meloni. Full hardware implementation of short addition chains recoding for ECC scalar multiplication. In Actes Conf´ erence d’informatique en Parall´ elisme, Architecture et Syst` eme (ComPAS), Lille, France, June 2015. http://tel.archives-ouvertes.fr/hal-01171095 (PDF)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 35/53

Addition Chain(s)

Definition

An addition chain for k ∈ N is a sequence of integers (a0, a1, . . . , al) satisfying: a0 = 1, al = k, and ai = ai1 + ai2 for some i2 ≤ i1 < i

Definition

A euclidean addition chain (EAC) satisfies the additional condition: for i ≥ 3, ai = ai1 + ai2 ⇒ ai+1 = ai + ai1 or ai+1 = ai + ai2 Example: for k = 14, possible EACs are

time

1 2 3 4 5 9 14

current terms

1 1 1 1 4 5

steps

1 1 1 2 3 5 8 11 14

current terms

1 1 2 3 3 3

steps

1 1

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 36/53

SLIDE 10

Scalar Multiplication using EACs

EAC coding of k ∈ N: C = EAC(k) = (clcl−1 . . . c1c0)

largest summand is added

ci = 0 (big step)

smallest summand is added

ci = 1 (small step) EAC based scalar multiplication algorithm: EAC Scalar Multiplication: C = EAC(k), point P

1: (U1, U2) ← (P, [2]P) 2: for ci ∈ C do 3: if ci = 0 then (U1, U2) ← (U2, U1 + U2) (big step) 4: else (U1, U2) ← (U1, U1 + U2) (small step) 5: return Q = U1 + U2 (Q = [k]P)

EAC: only ADD is used natural protection against SPA (simple power analysis, see [26])

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 37/53

State-of-the-Art Scalar Multiplication Algorithms

Algorithms SPA resilient

Double and Add (DA): basic algorithm

NO

Non-Adjacent Forms (NAF): recoding for faster schemes see [19]

NO

Montgomery Ladder (ML): regular algorithm [28]

YES

Unified Formulas (UF): same formula for ADD and DBL [9]

YES

. . .

Average cost per key/chain bit for several scalar multiplication methods: Method DA NAF-3 NAF-4 ML UF EAC Source [19] [19] [19] [28] [9] [27] Cost (m/bit) 17 13.5 12.8 24 19 7 Warning: n = length(k) and l = length(C) are different! EAC is efficient for l ≤ 2n

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 38/53

Short EAC Recoding Algorithm

Computing an EAC is easy but computing a short one is very hard! Apply the subtractive version of Euclid’s algorithm for (k, g) where gcd(k, g) = 1 and g well chosen (it strongly impacts l) Random choice leads to l ≈ O(ln(k)2) which is too slow (see [28]) Random search process in range 2ǫ around g = k

φ

(with φ = 1+

√ 5 2

):

1. compute k

φ (mult. by cst. 1 φ)

m minimum number of big steps at the end of C

2. get m the minimal number of starting big steps (in a table)
3. in parallel:

◮ start EAC scalar multiplication with m big steps ◮ try all g in interval

k

φ − ε, k φ + ε

to find a short EAC of k
4. finish scalar multiplication using the shortest EAC found

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 39/53

Statistical Analysis of EAC Recodings

Simulations for measuring l

n for g ∈

k

φ − ε, k φ + ε

and various ǫ values

(averaged over 1000 random scalars)

300 400 500 600 700 800 900 1000 200 250 300 350 400 len(EAC) [bits] y=2x ε<=10 200 250 300 350 400 y=2x ε<=100 200 250 300 350 400 y=2x ε<=1000 1.8 1.9 2 2.1 2.2 2.3 200 250 300 350 400 450 500 len(EAC) / len(k) scalar length [bits] ε<=10 ε<=50 ε<=100 ε<=200 ε<=500 ε<=1000 speed up zone

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 40/53

SLIDE 11

Proposed EAC Recoding Unit

Euclide computation

f C

MEM.

(BRAM) 1/φ, k, k/φ, a, b, C, C′

a(j) − b(j) b(j) − a(j)

a(j) 2

− b(j)

b(j) 2 − a(j)

k

unused unused

cout cout

CTRL C SIPO C

C

LSB(a(j)) LSB(b(j))

computation

f

k φ

± + + ε + CTRL @

ffset C′
ffset C
ffset b
ffset a
ffset k/φ
ffset k

write ports read ports

address control signals scalar word digit w-bit data word

w = 32 bits in this work

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 41/53

FPGA Implementation: Recoding Unit

FPGA BRAM

ptimization: area
ptimization: speed

Rec. area freq. area freq. Meth. n slices (FF/LUT) MHz slices (FF/LUT) MHz Spartan6 EAC 160 1 209 (636/441) 151 211 (662/476) 151 192 1 195 (641/441) 160 223 (634/476) 157 256 1 203 (636/441) 155 214 (662/476) 154 384 1 228 (652/441) 154 215 (682/476) 159 Spartan6 Bin. & ML 160 1 26 (70/101) 466 73 (194/237) 388 192 1 26 (70/101) 466 75 (231/270) 387 256 1 26 (70/101) 466 94 (300/336) 377 384 1 26 (70/101) 466 128 (446/475) 379 Spartan6 NAF-3 160 1 35 (104/122) 328 34 (108/130) 382 192 1 35 (104/122) 328 34 (108/130) 382 256 1 35 (104/122) 322 34 (108/130) 364 384 1 42 (120/123) 248 43 (123/131) 332 Spartan6 NAF-4 160 1 40 (109/125) 333 36 (113/134) 388 192 1 40 (109/125) 333 36 (113/134) 388 256 1 40 (109/125) 333 36 (113/134) 365 384 1 45 (129/126) 236 43 (132/135) 320 Virtex5 EAC 160 1 278 (711/509) 198 276 (709/476) 206 NAF-4 160 1 60 (148/161) 418 61 (147/157) 413 MBNS 160 1 153 (301/412) 232 323 (682/908) 202 [11] n/12 + 4 clock cycles n/24 + 4 clock cycles

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 42/53

FPGA Implementation: Complete Crypto-Processor

AU1 AU2 AU3 Points Mem. CTRL

address coord.

Prg. Mem.

instruction addr

Recoding Unit

ki k 21-bit instruction address control signals scalar word digit w-bit data word

n = 192 bits
Spartan-6 FPGA
Small configuration (1 mult., 1

add., 1 inv.)

recoding BRAM

ptim.

area freq. dura. SCA method target slices (FF/LUT) MHz ms prot. EAC 3 area 534 (1813/1508) 132 35.8 Y speed 556 (1872/1523) 137 34.5 DA 2 area 429 (1243/1134) 191 30 N speed 399 (1302/1222) 177 32.5 ML 2 area 429 (1243/1134) 191 42.5 Y speed 399 (1302/1222) 177 45.8 UF 2 area 429 (1243/1134) 191 50.4 Y speed 399 (1302/1222) 177 54.4 NAF-3 2 area 422 (1280/1157) 181 25.2 N speed 423 (1321/1242) 175 26.1 NAF-4 2 area 420 (1277/1161) 158 27.3 N speed 425 (1233/1246) 177 24.4

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 43/53

Hardware Implementation of RNS for ECC (1/2)

RNS: Residue Number System

Base B = (m1, m2, . . . , mk) of k relatively prime moduli
Size of the base: k

A = {a1, a2, . . . , ak}, ∀i ai = A mod mi Operations: A ± B = (|a1 ± b1|m1, . . . , |ak ± bk|mk) A × B = (|a1 × b1|m1, . . . , |ak × bk|mk)

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 44/53

SLIDE 12

Hardware Implementation of RNS for ECC (2/2)

Rower 1 w w

mod3

Rower 2 w w

mod3

. . . . . .

Rower n w w

mod3

Cox

mod3

|q|3 |q|4 |s|4 |s|3 . . . . . .

. . .

t + 2

. . . registers I/O w

channel 1

w w 2

channel 2

w w 2

channel n

w w 2

. . . . . .

CTRL

30-state FSM

. . . . . .

CTRL (shared) local reg. {@, en, r/w}

Arithmetic Unit (6 pipeline stages)

{rst, mode, . . .}

w w w w w

IN

w

OUT

w

mod3 OUT mod3

2

cmp

w

= 1 = −1 precomp. mult. ≈ 2n × w

w

@1

precomp. ri (×2)

@2

⌈log2 ri⌉ precomp. add. 38 × w

@3

w

Optimized algorithms and implementations for Fp operations:

modular inversion: 10× speedup [3, 6]
modular multiplication: 1

2× area-delay cost [5]

operations patterns [4]
PhD thesis Karim Bigou [2]
hybrid postions-residues (HPR) representation: 1

2× area-delay cost [7]

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 45/53

PAVOIS Integrated Circuit

ECC 256 bits 65 nm CMOS 1.5 mm2

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 46/53

Our Long Term Objectives

Study the links between:

cryptosystems
arithmetic algorithms
Fq, pts representations
architectures & units
circuit optimizations

to ensure

high security against

◮ theoretical attacks ◮ physical attacks

low design cost
low silicon cost
low energy(/power)
high performances
high flexibility

area 1 1 + a delay 1 1 + t energy 1 1 + e a, t, e ∈ 0%, 5%, 10%, . . . , 100% security 1 ×10 ×100

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 47/53

References I

[1]

D. J. Bernstein, P. Birkner, T. Lange, and C. Peters.

Optimizing double-base elliptic-curve single-scalar multiplication. In Proc. 8th International Conference on Progress in Cryptology (INDOCRYPT), volume 4859 of LNCS, pages 167–182, Chennai, India, December 2007. Springer. [2]

K. Bigou.

´ Etude th´ eorique et implantation mat´ erielle d’unit´ es de calcul en repr´ esentation modulaire des nombres pour la cryptographie sur courbes elliptiques. Phd thesis, University Rennes 1, Lannion, France, November 2014. [3]

K. Bigou and A. Tisserand.

Improving modular inversion in RNS using the plus-minus method. In G. Bertoni and J.-S. Coron, editors, Proc. 15th International Workshop on Cryptographic Hardware and Embedded Systems (CHES), volume 8086 of LNCS, pages 233–249, Santa Barbara, CA, USA, August 2013. Springer. [4]

K. Bigou and A. Tisserand.

RNS modular multiplication through reduced base extensions. In H. Fu and D. Thomas, editors, Proc. 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 57–62, Zurich, Switzerland, June 2014. IEEE. [5]

K. Bigou and A. Tisserand.

Single base modular multiplication for efficient hardware RNS implementations of ECC. In T. Guneysu and H. Handschuh, editors, Proc. 17th International Workshop on Cryptographic Hardware and Embedded Systems (CHES), volume 9293 of LNCS, pages 123–140, Saint-Malo, France, September 2015. Springer. [6]

K. Bigou and A. Tisserand.

Binary-ternary plus-minus modular inversion in RNS. IEEE Transactions on Computers, 65(11):3495–3501, November 2016. [7]

K. Bigou and A. Tisserand.

Hybrid position-residues number system. In J. Hormigo, S. Oberman, and N. Revol, editors, Proc. 23rd Symposium on Computer Arithmetic (ARITH), pages 126–133, Santa Clara, CA, U.S.A, July 2016. IEEE Computer Society. Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 48/53

SLIDE 13

References II

[8]

N. Boullis and A. Tisserand.

Some optimizations of hardware multiplication by constant matrices. IEEE Transactions on Computers, 54(10):1271–1282, October 2005. [9]

E. Brier, M. Joye, and I. Dechene.

Embedded Cryptographic Hardware, chapter Unified Point Addition Formulae for Elliptic Curve Cryptosystems, pages 247–256. Nova Science, 2004. [10]

A. Byrne, E. Popovici, and W.P. Marnane.

Versatile processor for gf(pm) arithmetic for use in cryptographic applications. IET Computers & Digital Techniques, 2(4):253–264, July 2008. [11]

T. Chabrier and A. Tisserand.

On-the-fly multi-base recoding for ECC scalar multiplication without pre-computations. In A. Nannarelli, P.-M. Seidel, and P. T. P. Tang, editors, Proc. 21st Symposium on Computer Arithmetic (ARITH), pages 219–228, Austin, TX, U.S.A, April 2013. IEEE Computer Society. [12]

J. Chen, A. Tisserand, E. M. Popovici, and S. Cotofana.

Robust sub-powered asynchronous logic. In J. Becker and M. R. Adrover, editors, Proc. 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pages 1–7, Palma de Mallorca, Spain, September 2014. IEEE. [13]

J. Chen, A. Tisserand, E. M. Popovici, and S. Cotofana.

Asynchronous charge sharing power consistent Montgomery multiplier. In J. Sparso and E Yahya, editors, Proc. 21st IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), pages 132–138, Mountain View, California, USA, May 2015. [14]

V. Dimitrov, L. Imbert, and P. K. Mishra.

Efficient and secure elliptic curve point multiplication using double-base chains. In Proc. 11th International Conference on the Theory and Application of Cryptology and Information Security (ASIACRYPT), volume 3788 of LNCS, pages 59–78, Chennai, India, December 2005. Springer. Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 49/53

References III

[15]

V. Dimitrov, L. Imbert, and P. K. Mishra.

The double-base number system and its application to elliptic curve cryptography. Mathematics of Computation, 77(262):1075–1104, April 2008. [16]

C. Doche and L. Imbert.

Extended double-base number system with applications to elliptic curve cryptography. In Proc. 7th International Conference on Progress in Cryptology (INDOCRYPT), volume 4329 of LNCS, pages 335–348, Kolkata, India, December 2006. Springer. [17]

G. Gallin and A. Tisserand.

Hardware architectures for HECC. 15th International Workshop on Cryptographic Architectures Embedded in Reconfigurable Devices (CryptArchi), June 2017. Smolenice, Slovakia. [18]

P. Giorgi, L. Imbert, and T. Izard.

Optimizing elliptic curve scalar multiplication for small scalars. In Proc. Mathematics for Signal and Information Processing, volume 7444, pages 74440N:1–10, San Diego, CA, USA, September 2009. SPIE. [19]

D. Hankerson, A. Menezes, and S. Vanstone.

Guide to Elliptic Curve Cryptography. Springer, 2004. [20]

K. Itoh, M. Takenaka, N. Torii, S. Temma, and Y. Kurihara.

Fast implementation of public-key cryptography on a DSP TMS320C6201. In Proc. Cryptographic Hardware and Embedded Systems (CHES), volume 1717 of LNCS, pages 61–72, Worcester, MA, USA, August 1999. Springer. [21]

T. Jebelean.

An algorithm for exact division. Journal of Symbolic Computation, 15(2):169–180, February 1993. Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 50/53

References IV

[22]

P. Longa and C. Gebotys.

Setting speed records with the (fractional) multibase non-adjacent form method for efficient elliptic curve scalar multiplication. Technical Report 118, Cryptology ePrint Archive, 2008. [23]

P. Longa and C. Gebotys.

Fast multibase methods and other several optimizations for elliptic curve scalar multiplication. In Proc. Public Key Cryptography (PKC), volume 5443 of LNCS, pages 443–462, 2009. [24]

P. Longa and A. Miri.

New multibase non-adjacent form scalar multiplication and its application to elliptic curve cryptosystems. Technical Report 52, Cryptology ePrint Archive, 2008. [25]

Y. Ma, Z. Liu, W. Pan, and J. Jing.

A high-speed elliptic curve cryptographic processor for generic curves over GF(p). In Proc. 20th International Workshop on Selected Areas in Cryptography (SAC), volume 8282 of LNCS, pages 421–437, Burnaby, BC, Canada, August 2013. Springer. [26]

S. Mangard, E. Oswald, and T. Popp.

Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer, 2007. [27]

N. Meloni.

New point addition formulae for ECC applications. In Proc. 1st International Workshop on Arithmetic of Finite Fields (WAIFI), volume 4547 of LNCS, pages 189–201, Madrid, Spain, June 2007. Springer. [28]

P. L. Montgomery.

Speeding the pollar and elliptic curves methods of factorisation. Mathematics of Computation, 48(177):243–264, January 1987. [29]

D. Pamula.

Arithmetic Operators on GF(2m) for Cryptographic Applications: Performance - Power Consumption - Security Tradeoffs. Phd thesis, University of Rennes 1 and Silesian University of Technology, December 2012. Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 51/53

References V

[30]

D. Pamula, E. Hrynkiewicz, and A. Tisserand.

Analysis of GF(2233) multipliers regarding elliptic curve cryptosystem applications. In 11th IFAC/IEEE International Conference on Programmable Devices and Embedded Systems (PDeS), pages 271–276, Brno, Czech Republic, May 2012. [31]

D. Pamula and A. Tisserand.

GF(2m) finite-field multipliers with reduced activity variations. In 4th International Workshop on the Arithmetic of Finite Fields, volume 7369 of LNCS, pages 152–167, Bochum, Germany, July 2012. Springer. [32]

D. Pamula and A. Tisserand.

Fast and secure finite field multipliers. In Proc. 18th Euromicro Conference on Digital System Design (DSD), pages 653–660, Madeira, Portugal, August 2015. [33]

B. Pascal.

Œuvres compl` etes, volume 5, chapter De Numeribus Multiplicibus, pages 117–128. Librarie Lef` evre, 1819. [34]

J. Proy, N. Veyrat-Charvillon, A. Tisserand, and N. Meloni.

Full hardware implementation of short addition chains recoding for ECC scalar multiplication. In Actes Conf´ erence d’informatique en Parall´ elisme, Architecture et Syst` eme (ComPAS), Lille, France, June 2015. [35]

J. Sakarovitch.

Elements of Automata Theory, chapter Prologue: M. Pascal’s Division Machine, pages 1–6. Cambridge, 2009. Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 52/53

SLIDE 14

The end, questions ?

Contact:

mailto:arnaud.tisserand@univ-ubs.fr
http://www-labsticc.univ-ubs.fr/~tisseran
CNRS, Lab-STICC Laboratory

University South Brittany (UBS), Centre de recherche C. Huygens, rue St Maud´ e, BP 92116, 56321 Lorient cedex, France Thank you

Arnaud Tisserand. CNRS – Lab-STICC. Arithmetic Tradeoffs for Hardware Asymmetric Cryptography 53/53