[PPT] - Pushing the Limits of High-Speed GF (2 m ) Elliptic Curve Scalar PowerPoint Presentation

SLIDE 1

Pushing the Limits of High-Speed GF(2m) Elliptic Curve Scalar Multiplier on FPGAs

Chester Rebeiro, Sujoy Sinha Roy, and Debdeep Mukhopadhyay

Secured Embedded Architecture Lab Indian Institute of Technology Kharagpur India 9/12/2012

CHES 2012, Leuven Belgium

SLIDE 2

Elliptic Curve Scalar Multiplication

An elliptic curve over GF(2m) is a set of points which satisfies

the equation y 2 + xy = x3 + ax2 + b , where a, b ∈ GF(2m) and b = 0. The points on the elliptic curve form an additive group.

The projective coordinate representation of the curve is

Y 2 + XYZ = X 3 + aX 2Z 2 + bZ 4

Scalar Multiplication :

Given a base point P = (XP, YP, ZP)

n the elliptic curve and a scalar s compute Q = sP

(i.e. Q = P + P + P + · · · (stimes))

9/12/2012 CHES 2012, Leuven Belgium 2

SLIDE 3

Montgomery Ladder for Scalar Multiplication

Inputs : scalar s = (st−1st−2 · · · s1s0)2 basepoint P Output : Scalar Product Q = sP

1 P1 = (X1, Y1, Z1) ← P and P2 = (X2, Y2, Z2) ← 2 · P 2 For each bit sk (for k = t − 2, t − 3, · · · , 0)

if sk = 1 then P1 ← P1 + P2

; P2 = 2 · P2

if sk = 0 then P2 ← P1 + P2

; P1 = 2 · P1

3 Q = Projective2Affine(P1)

9/12/2012 CHES 2012, Leuven Belgium 3

SLIDE 4

Montgomery Ladder for Scalar Multiplication

Inputs : scalar s = (st−1st−2 · · · s1s0)2 basepoint P Output : Scalar Product Q = sP

1 P1 = (X1, Y1, Z1) ← P and P2 = (X2, Y2, Z2) ← 2 · P 2 For each bit sk (for k = t − 2, t − 3, · · · , 0)

if sk = 1 then P1 ← P1 + P2

; P2 = 2 · P2

if sk = 0 then P2 ← P1 + P2

; P1 = 2 · P1

3 Q = Projective2Affine(P1)

Performing Pi ← Pi + Pj and Pj ← 2 · Pi

Xi ← Xi · Zj ; Zi ← Xj · Zi ; T ← Xj ; Xj ← X 4

j + b · Z 4 j

Zj ← (T · Zj)2 ; T ← Xi · Zi ; Zi ← (Xi + Zi)2 ; Xi ← x · Zi + T . . . all operations are in GF(2m)

9/12/2012 CHES 2012, Leuven Belgium 3

SLIDE 5

Engineering the Montgomery Ladder for Scalar Multiplication

Multiplication

Elliptic Curve Finite Field Operations Group Operations

Scalar

(a) The ECC Pyramid

Regbank sP s Arithmetic Unit ROM Control Unit

(b) Block Diagram

9/12/2012 CHES 2012, Leuven Belgium 4

SLIDE 6

Engineering the Montgomery Ladder for Scalar Multiplication

Multiplication

Elliptic Curve Finite Field Operations Group Operations

Scalar

(a) The ECC Pyramid

Regbank sP s Arithmetic Unit ROM Control Unit

(b) Block Diagram

High-speed scalar multiplication on FPGAs

Minimize area by maximizing utilization of available resources
Optimal Pipelining
Efficient Scheduling of Operations

9/12/2012 CHES 2012, Leuven Belgium 4

SLIDE 7

Field Programmable Gate Arrays

Provides the speed of hardware and the reconfigurablitity of

software

FPGA Architecture

Programmable Connection Routing Switches Logic Block Programmable Switch

(a) FPGA Island

9/12/2012 CHES 2012, Leuven Belgium 5

SLIDE 8

Field Programmable Gate Arrays

Provides the speed of hardware and the reconfigurablitity of

software

FPGA Architecture

Programmable Connection Routing Switches Logic Block Programmable Switch

(a) FPGA Island

CLK CIN COUT F1 F2 F3 F4 CLK CE SR BY PRE D CE Q CLR

LUT & Carry Logic Control

(b) Lookup Table

9/12/2012 CHES 2012, Leuven Belgium 5

SLIDE 9

LUT Utilization

LUT

Four (or six) input → one output
Can implement any four (or six) input truth table
y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4

Requires one LUT.

y2 = x1 ⊕ x2

Still requires one LUT.

9/12/2012 CHES 2012, Leuven Belgium 6

SLIDE 10

LUT Utilization

LUT

Four (or six) input → one output
Can implement any four (or six) input truth table
y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4

Requires one LUT.

y2 = x1 ⊕ x2

Still requires one LUT.

y2 results in an under utilized LUT.

. . . need to maximize LUT utilization to minimize area.

9/12/2012 CHES 2012, Leuven Belgium 6

SLIDE 11

Finite field Multiplier for Best LUT utilization

7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

58 58 59 116 117 58

14 15 14 15 14 15 15 15 15 15 14

233 Karatusba Multiplier 29 29 29 30

(a) Karatsuba-Ofman Multiplication

[VLSID 2008]

9/12/2012 CHES 2012, Leuven Belgium 7

SLIDE 12

Finite field Multiplier for Best LUT utilization

7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

58 58 59 116 117 58

14 15 14 15 14 15 15 15 15 15 14

233 Karatusba Multiplier 29 29 29 30

(a) Karatsuba-Ofman Multiplication

233

29 29 29 29 14 14 15 14 15

58 58 59

29 29 29 30

116 117 58

14 15 14 15 14 15 15 15 15 15 14

Karatusba Multiplier

77 78 7778 7778 7778 7778 7778 77 78 7856

Classical Multiplier

(b) Hybrid Karatsuba Multiplication

[VLSID 2008]

9/12/2012 CHES 2012, Leuven Belgium 7

SLIDE 13

Finite field Multiplier for Best LUT utilization

7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

58 58 59 116 117 58

14 15 14 15 14 15 15 15 15 15 14

233 Karatusba Multiplier 29 29 29 30

(a) Karatsuba-Ofman Multiplication

233

29 29 29 29 14 14 15 14 15

58 58 59

29 29 29 30

116 117 58

14 15 14 15 14 15 15 15 15 15 14

Karatusba Multiplier

77 78 7778 7778 7778 7778 7778 77 78 7856

Classical Multiplier

(b) Hybrid Karatsuba Multiplication

8800 9600 4 6 8 10 12 14 16 18 20 22

LUTs Threshold (c) Finding the Right Threshold

[VLSID 2008]

9/12/2012 CHES 2012, Leuven Belgium 7

SLIDE 14

Finite field Multiplier for Best LUT utilization

7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

58 58 59 116 117 58

14 15 14 15 14 15 15 15 15 15 14

233 Karatusba Multiplier 29 29 29 30

(a) Karatsuba-Ofman Multiplication

233

29 29 29 29 14 14 15 14 15

58 58 59

29 29 29 30

116 117 58

14 15 14 15 14 15 15 15 15 15 14

Karatusba Multiplier

77 78 7778 7778 7778 7778 7778 77 78 7856

Classical Multiplier

(b) Hybrid Karatsuba Multiplication

8800 9600 4 6 8 10 12 14 16 18 20 22

LUTs Threshold (c) Finding the Right Threshold

100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+06 1.1e+06 90 180 270 360 450 540 Area * Time Number of bits Karatsuba-Ofman Hybrid Karatsuba

(d) Comparing Multipliers

[VLSID 2008]

9/12/2012 CHES 2012, Leuven Belgium 7

SLIDE 15

Finite Field Inversion Using Itoh-Tsujii Algorithm

Given a ∈ GF(2m), find a−1 ∈ GF(2m) such that a · a−1 = 1
Fermat’s Little Theorem : a−1 = a2m−2
Itoh-Tsujii Algorithm

1 Define the addition chain for m − 1

(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))

2 Compute

a → a22−1 → a23−1 → a26−1 → a27−1 → a214−1 · · · → a2232−1

3 Square to get a2233−2

9/12/2012 CHES 2012, Leuven Belgium 8

SLIDE 16

Finite Field Inversion Using Itoh-Tsujii Algorithm

Given a ∈ GF(2m), find a−1 ∈ GF(2m) such that a · a−1 = 1
Fermat’s Little Theorem : a−1 = a2m−2
Itoh-Tsujii Algorithm

1 Define the addition chain for m − 1

(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))

2 Compute

a → a22−1 → a23−1 → a26−1 → a27−1 → a214−1 · · · → a2232−1

3 Square to get a2233−2

Exponentiation requires a series of cascaded squarers called

powerblock along with a finite field multiplier

Square Square Square Square Circuit−1 Circuit−11 Multiplexer qsel

Input Output

Circuit−2 Circuit−3

9/12/2012 CHES 2012, Leuven Belgium 8

SLIDE 17

Using Higher Exponents in the Itoh-Tsujii Algorithm

Consider using a quad circuit instead of a square.

This requires an addition chain to m−1

2

instead of m − 1 thus finishes faster. [IEEE TVLSI 2011, DATE 2011]

9/12/2012 CHES 2012, Leuven Belgium 9

SLIDE 18

Using Higher Exponents in the Itoh-Tsujii Algorithm

Consider using a quad circuit instead of a square.

This requires an addition chain to m−1

2

instead of m − 1 thus finishes faster.

The frequency of operation is not affected and area used is

less due to better LUT utilization.

Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA

Field Squarer Circuit Quad Circuit Size ratio #LUTs Delay (ns) #LUTq Delay (ns)

#LUTq 2(#LUTs )

GF(2193) 96 1.48 145 1.48 0.75 GF(2233) 153 1.48 230 1.48 0.75

[IEEE TVLSI 2011, DATE 2011]

9/12/2012 CHES 2012, Leuven Belgium 9

SLIDE 19

Using Higher Exponents in the Itoh-Tsujii Algorithm

Consider using a quad circuit instead of a square.

This requires an addition chain to m−1

2

instead of m − 1 thus finishes faster.

The frequency of operation is not affected and area used is

less due to better LUT utilization.

Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA

Field Squarer Circuit Quad Circuit Size ratio #LUTs Delay (ns) #LUTq Delay (ns)

#LUTq 2(#LUTs )

GF(2193) 96 1.48 145 1.48 0.75 GF(2233) 153 1.48 230 1.48 0.75

Larger exponent circuits can similarly be used to obtain faster

results. [IEEE TVLSI 2011, DATE 2011]

9/12/2012 CHES 2012, Leuven Belgium 9

SLIDE 20

Using Higher Exponents in the Itoh-Tsujii Algorithm

Consider using a quad circuit instead of a square.

This requires an addition chain to m−1

2

instead of m − 1 thus finishes faster.

The frequency of operation is not affected and area used is

less due to better LUT utilization.

Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA

Field Squarer Circuit Quad Circuit Size ratio #LUTs Delay (ns) #LUTq Delay (ns)

#LUTq 2(#LUTs )

GF(2193) 96 1.48 145 1.48 0.75 GF(2233) 153 1.48 230 1.48 0.75

Larger exponent circuits can similarly be used to obtain faster

results.

However there is an initial overhead of computing a2q−1,

which increases as the exponent circuit increases. [IEEE TVLSI 2011, DATE 2011]

9/12/2012 CHES 2012, Leuven Belgium 9

SLIDE 21

The Arithmetic Unit

2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock

Squarer

C1 C2

Squarer Quad

A_sel MUX A A0 Field (HBKM) (#11) (#2) (#2) (#2) (#2) (#2) (#2) MUX F (#2) (#1) (#1) (#2) (#1) C MUX

ARITHMETIC UNIT

(#2) (#2) (#2)

Multiplier

Finite Field Multiplication Power Block

9/12/2012 CHES 2012, Leuven Belgium 10

SLIDE 22

The Arithmetic Unit

2n 2n Circuit 1 2n us C_sel Circuit 2 Circuit Pow_sel A0 C0 Powerblock

Field

MUX F (#2) (#1) C MUX

ARITHMETIC UNIT

(#2) (#2) (#2)

Multiplier

^4 ^2 ^4 ^2

Power Block

9/12/2012 CHES 2012, Leuven Belgium 11

SLIDE 23

The Arithmetic Unit

2n 2n Circuit 1 2n us C_sel Circuit 2 Circuit A0

Powerblock Field

ARITHMETIC UNIT

Multiplier

^4 ^2 ^4 ^2

9/12/2012 CHES 2012, Leuven Belgium 12

SLIDE 24

The Register Bank

2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock

Squarer

C1 C2 A3 A2 A1 A0

Squarer Quad

A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin

REGISTER BANK

MUX F C MUX

ARITHMETIC UNIT

Six registers implemented as flip-flops to maximize CLB utilization

9/12/2012 CHES 2012, Leuven Belgium 13

SLIDE 25

Pipelining the Processor

How many stages of pipelining ?
To many would increase the frequency of operation but makes

it difficult to schedule instructions without bubbles

To few would result in a low frequency of operation

9/12/2012 CHES 2012, Leuven Belgium 14

SLIDE 26

Pipelining the Processor

How many stages of pipelining ?
To many would increase the frequency of operation but makes

it difficult to schedule instructions without bubbles

To few would result in a low frequency of operation
Where to place the pipeline stages?

9/12/2012 CHES 2012, Leuven Belgium 14

SLIDE 27

Pipelining the Processor

How many stages of pipelining ?
To many would increase the frequency of operation but makes

it difficult to schedule instructions without bubbles

To few would result in a low frequency of operation
Where to place the pipeline stages?

2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH

(a) A Bad Pipline Stage Placement

2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH

(b) A Good Pipeline Stage Placement

9/12/2012 CHES 2012, Leuven Belgium 14

SLIDE 28

Pipelining the Processor

How many stages of pipelining ?
To many would increase the frequency of operation but makes

it difficult to schedule instructions without bubbles

To few would result in a low frequency of operation
Where to place the pipeline stages?

2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH

(a) A Bad Pipline Stage Placement

2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH

(b) A Good Pipeline Stage Placement

Can we determine the best pipeline strategy at the design exploration phase?

9/12/2012 CHES 2012, Leuven Belgium 14

SLIDE 29

Optimally Pipeline into L Stages Apriori

1 Model the delay of the design 2 Use the model to identify all critical paths (with delay t) 3 Place pipeline registers to ensure that no path has delay

greater than t

L

9/12/2012 CHES 2012, Leuven Belgium 15

SLIDE 30

Modeling the Delay

The delay of a circuit is proportional to the number of LUTs

in the critical path

4 5 6 7 8 9 10 5 6 7 8 9 10 11

No. of LUTs in Critical Path

Delay of Circuit (in ns)

9/12/2012 CHES 2012, Leuven Belgium 16

SLIDE 31

Modeling the Delay

The delay of a circuit is proportional to the number of LUTs

in the critical path

4 5 6 7 8 9 10 5 6 7 8 9 10 11

No. of LUTs in Critical Path

Delay of Circuit (in ns)

The number of LUTs in the critical path of an n input

Boolean function f (x1, x2, x3, · · · xn) is ⌈logk(n)⌉, where k is the number of inputs to the LUT.

9/12/2012 CHES 2012, Leuven Belgium 16

SLIDE 32

Modeling LUTdelay of all Elements in the Processor

Component k − LUT Delay for k ≥ 4 m = 163, k = 4 m bit field adder 1 1 m bit n : 1 Mux ⌈logk(n + log2n)⌉ 2 (for n = 4) (Dn:1(m)) 1 (for n = 2) Exponentiation max(LUTDelay(di )), where di is the ith 2 (for n = 1) Circuit (D2n (m))

utput bit of the exponentiation circuit

2 (for n = 2) Powerblock us × D2n (m) + Dus :1(m) 4 (for us = 2) (Dpowerblk(m)) Modular Reduction 1 for irreducible trinomials 2 (Dmod ) 2 for pentanomials (for pentanomials) HBKM 11 (for τ = 11) (DHBKM(m)) Dsplit + Dthreshold + Dcombine + Dmod = ⌈logk( m

τ )⌉ + ⌈logk(2τ)⌉

+⌈log2 m

τ

⌉ + Dmod 9/12/2012 CHES 2012, Leuven Belgium 17

SLIDE 33

Critical Path : Placing Pipeline Registers Optimally

C_sel B_sel A3 A1 Powerblock

Squarer

C2 A3 A2 A1 A0 Field Critical Path Multiplier (HBKM) C2 C1 C0 Qout Qin

REGISTER BANK

Registers

ARITHMETIC UNIT (#2) (#1) (#1) (#1) (#2) (#2) (#11) (#2) (#2) (#2) (#2) (#2) (#2) (#2) (#2) (#2) (#2)

9/12/2012 CHES 2012, Leuven Belgium 18

SLIDE 34

Critical Path : Placing Pipeline Resgisters Optimally

Squarer Squarer

4:1 Input Mux Register Bank Merged Squarer and Adder HBKM

(#1)

MUX

B

MUX

D

Pipeline Register

(#11) (#2) (#6) (#6) (#5) (#6)

Register Bank 4:1 Output Mux

(#2) (#2) (#1) (#2) (#2)

4 stage pipeline

Critical path length : 23 LUTs
Time Period : ⌈23

4 ⌉ = 6 LUTs

9/12/2012 CHES 2012, Leuven Belgium 19

SLIDE 35

Scheduling of Operations (for 1 bit)

Table: Scheduling Instructions

ek

1 : Xi ← Xi · Zj

ek

4 : Zj ← (T · Zj )2

ek

2 : Zi ← Xj · Zi

ek

5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2

ek

3 : T ← Xj ; Xj ← X 4 j + b · Z4 j

ek

6 : Xi ← x · Zi + T

9/12/2012 CHES 2012, Leuven Belgium 20

SLIDE 36

Scheduling of Operations (for 1 bit)

Table: Scheduling Instructions

ek

1 : Xi ← Xi · Zj

ek

4 : Zj ← (T · Zj )2

ek

2 : Zi ← Xj · Zi

ek

5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2

ek

3 : T ← Xj ; Xj ← X 4 j + b · Z4 j

ek

6 : Xi ← x · Zi + T

ek

1

ek

2

ek

3

ek

5

ek

6

ek

4

(a) Data Dependencies Between Instructions

1 2 3 4 5 6 7 8 Clock Cycle ek

6

ek

5

ek

4

ek

3

ek

2 or ek 1

ek

1 or ek 2

(b) Timing Diagram for 3 Stage Pipeline

9/12/2012 CHES 2012, Leuven Belgium 20

SLIDE 37

Scheduling of 2 Consequtive bits : Overlapping two bits

When two consequtive bits are equal

sk−1 sk ek

1

ek

2

ek

3

ek

5

ek

6

ek

4

ek−1

1

ek−1

2

ek−1

3

ek−1

5

ek−1

6

ek−1

4 15 3 4 5 6 7 9 10 11 12 13 14 8 Clock Cycles 1 2

Stolen Clock Cycle

ek−1

6

ek−1

5

ek−1

4

ek−1

3

ek−1

2

ek−1

1

ek

6

ek

5

ek

4

ek

3

ek

2 or ek 1

ek

1 or ek 2

sk completes sk−2starts sk−1starts

9/12/2012 CHES 2012, Leuven Belgium 21

SLIDE 38

Scheduling of 2 Consequtive bits : Overlapping two bits

When two consequtive bits are NOT equal

sk−1 sk ek

1

ek

2

ek

3

ek

4

ek

6

ek

5

ek−1

1

ek−1

2

ek−1

3

ek−1

4

ek−1

5

ek−1

6 15 3 4 5 6 7 9 10 11 12 13 14 8 Clock Cycles 1 2

Stolen Clock Cycle

ek−1

6

ek−1

5

ek−1

4

ek−1

3

ek−1

2

ek−1

1

ek

6

ek

5

ek

4

ek

3

ek

2 or ek 1

ek

1 or ek 2

sk completes sk−1 starts sk−2 starts

9/12/2012 CHES 2012, Leuven Belgium 22

SLIDE 39

Clock Cycles for L Stage Pipeline

Without overlapping 2L + 2 clock cycles are required per bit
With overlapping, we save a clock cycle, so 2L + 1
Additionally,
If there is a pipline stage soon after the multiplier, then data

forwarding is feasible

Clock cycles per bit is 2L

1 2 3 4 5 6 7 8 Clock Cycle 9

Data Forwarding Path

10 ek

6

ek

5

ek

4

ek

3

sk−1 starts

ek

1 or ek 2

ek

2 or ek 1

Clock cycles saved are m or 2m

9/12/2012 CHES 2012, Leuven Belgium 23

SLIDE 40

Modeling Computation Time to find the Right Pipeline

So, we now know

How to place pipeline registers
and estimate the clock cycles required

Putting it together, we can estimate how long it takes to perform a scalar multiplication

Table: Computation Time Estimates for Various Values of L for an ECM over GF(2163) and FPGA with 4

input LUTs

L us DataForwarding Computation Time Feasible 1 9 No 1019 2 4 No 524 3 3 No 412 4 2 Yes 357 5 1 No 395 6 1 Yes 360 7 1 Yes 358

9/12/2012 CHES 2012, Leuven Belgium 24

SLIDE 41

Architecture Details

1 Z 2 Z X1 X2

us 2n 2n Circuit 1 2n A0 A1 A2 A3 A1 A1 A1 A3 C MOD Combine Outputs Threshold Mults Split Operands Combine Level 1 Combine Level 4

{

STAGE 1 STAGE 2 STAGE 3 STAGE 4

REGISTER BANK ARITHMETIC UNIT

Clock Reset Scalar

B_sel C_sel D_sel E_sel Pow_sel A0_sel A1_sel A2_sel A3_sel Qin_sel X_sel Z_sel X1_w X2_w Z1_w Z2_w T_w A_w Write Enable Register Multiplexer Select Lines

y

Z_sel

Control Unit MUX Z

X X_sel 1 Base Point in ROM Curve Constant

x A T

A_sel Qin_sel A2_sel A1_sel A3_sel A0_sel Data Forwarding Path Powerblock

Circuit 2 Circuit

b

Qin A0 A1 A2 A3 A2

D MUX MUX

C_sel D_sel Qout A0 Pow_sel A0 A3 A2 A0 A_sel B_sel E_sel MUX F C0 C1 C2

Squarer Quad

Squarer Quad MUX MUX MUX E B A

HBKM

MUX G MUX H MUX I MUX J K

9/12/2012 CHES 2012, Leuven Belgium 25

SLIDE 42

Finite State Machine

Initialization Coordinate Conversion Completion Completion AD2 AD3 AD4 AD9 AD1

neq

AD4

1

AD2

1

AD3

1

AD1

eq

AD9

1

I0 I1 I5 AD1 C1 C2

sk−1 = sk sk−1 = sk sk−1 = sk sk−1 = sk sk−1 = 1 sk−1 = 0 st−2 = 1 st−2 = 0 sk−1 = 0 sk−1 = 1 C125

9/12/2012 CHES 2012, Leuven Belgium 26

SLIDE 43

Comparisons

Work Platform Field Slices LUTs Freq Comp. (m) (MHz) Time (µs) Orlando XCV400E 163

3002

76.7 210 Bednara XCV1000 191

48300

36 270 Gura XCV2000E 163

19508

66.5 140 Lutz XCV2000E 163

10017

66 233 Saqib XCV3200 191 18314

10

56 Pu XC2V1000 193

3601

115 167 Ansari XC2V2000 163

8300

100 42 Rebeiro XC4V140 233 19674 37073 60 31 J¨ arvinen1 Stratix II 163 (11800ALMs)

48.9

Kim 2 XC4VLX80 163 24363

143

10.1 Chelton XCV2600E 163 15368 26390 91 33 XC4V200 163 16209 26364 153.9 19.5 Azarderakhsh XC4CLX100 163 12834 22815 196 17.2 XC5VLX110 163 6536 17305 262 12.9 Our Result (Virtex 4 FPGA) XC4VLX80 163 8070 14265 147 9.7 XC4V200 163 8095 14507 132 10.7 XC4VLX100 233 13620 23147 154 12.5 Our Result (Virtex 5 FPGA) XC5VLX85t 163 3446 10176 167 8.6 XC5VSX240 163 3513 10195 148 9.5 XC5VLX85t 233 5644 18097 156 12.3

1. uses 4 field multipliers; 2. uses 3 field multipliers; 3. uses 2 field multipliers

9/12/2012 CHES 2012, Leuven Belgium 27

SLIDE 44

Conclusions

We show the implementation of a high-speed elliptic curve

crypto-processor for FPGA platforms

The use of highly optimized finite field primitives, efficient

utilization of FPGA primitives, help reduce the area, which in turn make routing easier

A theoretically designed pipline strategy provides ideal

pipelining of the design to increase clock frequncy

Efficient scheduling with data forwarding machanisms reduce

clock cycle requirement

All these result in one of the fastest elliptic curve

implementation on FPGAs

Additionally, area required is significantly less compared to
ther high-speed designs

9/12/2012 CHES 2012, Leuven Belgium 28

SLIDE 45

Thank You for your Attention

9/12/2012 CHES 2012, Leuven Belgium 29