Pushing the Limits of High-Speed GF (2 m ) Elliptic Curve Scalar - - PowerPoint PPT Presentation
Pushing the Limits of High-Speed GF (2 m ) Elliptic Curve Scalar - - PowerPoint PPT Presentation
Pushing the Limits of High-Speed GF (2 m ) Elliptic Curve Scalar Multiplier on FPGAs Chester Rebeiro, Sujoy Sinha Roy, and Debdeep Mukhopadhyay Secured Embedded Architecture Lab Indian Institute of Technology Kharagpur India 9/12/2012 CHES
Elliptic Curve Scalar Multiplication
- An elliptic curve over GF(2m) is a set of points which satisfies
the equation y 2 + xy = x3 + ax2 + b , where a, b ∈ GF(2m) and b = 0. The points on the elliptic curve form an additive group.
- The projective coordinate representation of the curve is
Y 2 + XYZ = X 3 + aX 2Z 2 + bZ 4
- Scalar Multiplication :
Given a base point P = (XP, YP, ZP)
- n the elliptic curve and a scalar s compute Q = sP
(i.e. Q = P + P + P + · · · (stimes))
9/12/2012 CHES 2012, Leuven Belgium 2
Montgomery Ladder for Scalar Multiplication
Inputs : scalar s = (st−1st−2 · · · s1s0)2 basepoint P Output : Scalar Product Q = sP
1 P1 = (X1, Y1, Z1) ← P and P2 = (X2, Y2, Z2) ← 2 · P 2 For each bit sk (for k = t − 2, t − 3, · · · , 0)
- if sk = 1 then P1 ← P1 + P2
; P2 = 2 · P2
- if sk = 0 then P2 ← P1 + P2
; P1 = 2 · P1
3 Q = Projective2Affine(P1)
9/12/2012 CHES 2012, Leuven Belgium 3
Montgomery Ladder for Scalar Multiplication
Inputs : scalar s = (st−1st−2 · · · s1s0)2 basepoint P Output : Scalar Product Q = sP
1 P1 = (X1, Y1, Z1) ← P and P2 = (X2, Y2, Z2) ← 2 · P 2 For each bit sk (for k = t − 2, t − 3, · · · , 0)
- if sk = 1 then P1 ← P1 + P2
; P2 = 2 · P2
- if sk = 0 then P2 ← P1 + P2
; P1 = 2 · P1
3 Q = Projective2Affine(P1)
Performing Pi ← Pi + Pj and Pj ← 2 · Pi
Xi ← Xi · Zj ; Zi ← Xj · Zi ; T ← Xj ; Xj ← X 4
j + b · Z 4 j
Zj ← (T · Zj)2 ; T ← Xi · Zi ; Zi ← (Xi + Zi)2 ; Xi ← x · Zi + T . . . all operations are in GF(2m)
9/12/2012 CHES 2012, Leuven Belgium 3
Engineering the Montgomery Ladder for Scalar Multiplication
Multiplication
Elliptic Curve Finite Field Operations Group Operations
Scalar
(a) The ECC Pyramid
Regbank sP s Arithmetic Unit ROM Control Unit
(b) Block Diagram
9/12/2012 CHES 2012, Leuven Belgium 4
Engineering the Montgomery Ladder for Scalar Multiplication
Multiplication
Elliptic Curve Finite Field Operations Group Operations
Scalar
(a) The ECC Pyramid
Regbank sP s Arithmetic Unit ROM Control Unit
(b) Block Diagram
High-speed scalar multiplication on FPGAs
- Minimize area by maximizing utilization of available resources
- Optimal Pipelining
- Efficient Scheduling of Operations
9/12/2012 CHES 2012, Leuven Belgium 4
Field Programmable Gate Arrays
- Provides the speed of hardware and the reconfigurablitity of
software
- FPGA Architecture
Programmable Connection Routing Switches Logic Block Programmable Switch
(a) FPGA Island
9/12/2012 CHES 2012, Leuven Belgium 5
Field Programmable Gate Arrays
- Provides the speed of hardware and the reconfigurablitity of
software
- FPGA Architecture
Programmable Connection Routing Switches Logic Block Programmable Switch
(a) FPGA Island
CLK CIN COUT F1 F2 F3 F4 CLK CE SR BY PRE D CE Q CLR
LUT & Carry Logic Control
(b) Lookup Table
9/12/2012 CHES 2012, Leuven Belgium 5
LUT Utilization
LUT
- Four (or six) input → one output
- Can implement any four (or six) input truth table
- y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4
Requires one LUT.
- y2 = x1 ⊕ x2
Still requires one LUT.
9/12/2012 CHES 2012, Leuven Belgium 6
LUT Utilization
LUT
- Four (or six) input → one output
- Can implement any four (or six) input truth table
- y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4
Requires one LUT.
- y2 = x1 ⊕ x2
Still requires one LUT.
- y2 results in an under utilized LUT.
. . . need to maximize LUT utilization to minimize area.
9/12/2012 CHES 2012, Leuven Belgium 6
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
58 58 59 116 117 58
14 15 14 15 14 15 15 15 15 15 14
233 Karatusba Multiplier 29 29 29 30
(a) Karatsuba-Ofman Multiplication
[VLSID 2008]
9/12/2012 CHES 2012, Leuven Belgium 7
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
58 58 59 116 117 58
14 15 14 15 14 15 15 15 15 15 14
233 Karatusba Multiplier 29 29 29 30
(a) Karatsuba-Ofman Multiplication
233
29 29 29 29 14 14 15 14 15
58 58 59
29 29 29 30
116 117 58
14 15 14 15 14 15 15 15 15 15 14
Karatusba Multiplier
77 78 7778 7778 7778 7778 7778 77 78 7856
Classical Multiplier
(b) Hybrid Karatsuba Multiplication
[VLSID 2008]
9/12/2012 CHES 2012, Leuven Belgium 7
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
58 58 59 116 117 58
14 15 14 15 14 15 15 15 15 15 14
233 Karatusba Multiplier 29 29 29 30
(a) Karatsuba-Ofman Multiplication
233
29 29 29 29 14 14 15 14 15
58 58 59
29 29 29 30
116 117 58
14 15 14 15 14 15 15 15 15 15 14
Karatusba Multiplier
77 78 7778 7778 7778 7778 7778 77 78 7856
Classical Multiplier
(b) Hybrid Karatsuba Multiplication
8800 9600 4 6 8 10 12 14 16 18 20 22
LUTs Threshold (c) Finding the Right Threshold
[VLSID 2008]
9/12/2012 CHES 2012, Leuven Belgium 7
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 8 7 8 7 8 7 8 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
58 58 59 116 117 58
14 15 14 15 14 15 15 15 15 15 14
233 Karatusba Multiplier 29 29 29 30
(a) Karatsuba-Ofman Multiplication
233
29 29 29 29 14 14 15 14 15
58 58 59
29 29 29 30
116 117 58
14 15 14 15 14 15 15 15 15 15 14
Karatusba Multiplier
77 78 7778 7778 7778 7778 7778 77 78 7856
Classical Multiplier
(b) Hybrid Karatsuba Multiplication
8800 9600 4 6 8 10 12 14 16 18 20 22
LUTs Threshold (c) Finding the Right Threshold
100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+06 1.1e+06 90 180 270 360 450 540 Area * Time Number of bits Karatsuba-Ofman Hybrid Karatsuba
(d) Comparing Multipliers
[VLSID 2008]
9/12/2012 CHES 2012, Leuven Belgium 7
Finite Field Inversion Using Itoh-Tsujii Algorithm
- Given a ∈ GF(2m), find a−1 ∈ GF(2m) such that a · a−1 = 1
- Fermat’s Little Theorem : a−1 = a2m−2
- Itoh-Tsujii Algorithm
1 Define the addition chain for m − 1
(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))
2 Compute
a → a22−1 → a23−1 → a26−1 → a27−1 → a214−1 · · · → a2232−1
3 Square to get a2233−2
9/12/2012 CHES 2012, Leuven Belgium 8
Finite Field Inversion Using Itoh-Tsujii Algorithm
- Given a ∈ GF(2m), find a−1 ∈ GF(2m) such that a · a−1 = 1
- Fermat’s Little Theorem : a−1 = a2m−2
- Itoh-Tsujii Algorithm
1 Define the addition chain for m − 1
(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))
2 Compute
a → a22−1 → a23−1 → a26−1 → a27−1 → a214−1 · · · → a2232−1
3 Square to get a2233−2
- Exponentiation requires a series of cascaded squarers called
powerblock along with a finite field multiplier
Square Square Square Square Circuit−1 Circuit−11 Multiplexer qsel
Input Output
Circuit−2 Circuit−3
9/12/2012 CHES 2012, Leuven Belgium 8
Using Higher Exponents in the Itoh-Tsujii Algorithm
Consider using a quad circuit instead of a square.
- This requires an addition chain to m−1
2
instead of m − 1 thus finishes faster. [IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
Using Higher Exponents in the Itoh-Tsujii Algorithm
Consider using a quad circuit instead of a square.
- This requires an addition chain to m−1
2
instead of m − 1 thus finishes faster.
- The frequency of operation is not affected and area used is
less due to better LUT utilization.
Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
Field Squarer Circuit Quad Circuit Size ratio #LUTs Delay (ns) #LUTq Delay (ns)
#LUTq 2(#LUTs )
GF(2193) 96 1.48 145 1.48 0.75 GF(2233) 153 1.48 230 1.48 0.75
[IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
Using Higher Exponents in the Itoh-Tsujii Algorithm
Consider using a quad circuit instead of a square.
- This requires an addition chain to m−1
2
instead of m − 1 thus finishes faster.
- The frequency of operation is not affected and area used is
less due to better LUT utilization.
Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
Field Squarer Circuit Quad Circuit Size ratio #LUTs Delay (ns) #LUTq Delay (ns)
#LUTq 2(#LUTs )
GF(2193) 96 1.48 145 1.48 0.75 GF(2233) 153 1.48 230 1.48 0.75
- Larger exponent circuits can similarly be used to obtain faster
results. [IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
Using Higher Exponents in the Itoh-Tsujii Algorithm
Consider using a quad circuit instead of a square.
- This requires an addition chain to m−1
2
instead of m − 1 thus finishes faster.
- The frequency of operation is not affected and area used is
less due to better LUT utilization.
Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
Field Squarer Circuit Quad Circuit Size ratio #LUTs Delay (ns) #LUTq Delay (ns)
#LUTq 2(#LUTs )
GF(2193) 96 1.48 145 1.48 0.75 GF(2233) 153 1.48 230 1.48 0.75
- Larger exponent circuits can similarly be used to obtain faster
results.
- However there is an initial overhead of computing a2q−1,
which increases as the exponent circuit increases. [IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
The Arithmetic Unit
2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock
Squarer
C1 C2
Squarer Quad
A_sel MUX A A0 Field (HBKM) (#11) (#2) (#2) (#2) (#2) (#2) (#2) MUX F (#2) (#1) (#1) (#2) (#1) C MUX
ARITHMETIC UNIT
(#2) (#2) (#2)
Multiplier
Finite Field Multiplication Power Block
9/12/2012 CHES 2012, Leuven Belgium 10
The Arithmetic Unit
2n 2n Circuit 1 2n us C_sel Circuit 2 Circuit Pow_sel A0 C0 Powerblock
Field
MUX F (#2) (#1) C MUX
ARITHMETIC UNIT
(#2) (#2) (#2)
Multiplier
^4 ^2 ^4 ^2
Power Block
9/12/2012 CHES 2012, Leuven Belgium 11
The Arithmetic Unit
2n 2n Circuit 1 2n us C_sel Circuit 2 Circuit A0
Powerblock Field
ARITHMETIC UNIT
Multiplier
^4 ^2 ^4 ^2
9/12/2012 CHES 2012, Leuven Belgium 12
The Register Bank
2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock
Squarer
C1 C2 A3 A2 A1 A0
Squarer Quad
A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin
REGISTER BANK
MUX F C MUX
ARITHMETIC UNIT
Six registers implemented as flip-flops to maximize CLB utilization
9/12/2012 CHES 2012, Leuven Belgium 13
Pipelining the Processor
- How many stages of pipelining ?
- To many would increase the frequency of operation but makes
it difficult to schedule instructions without bubbles
- To few would result in a low frequency of operation
9/12/2012 CHES 2012, Leuven Belgium 14
Pipelining the Processor
- How many stages of pipelining ?
- To many would increase the frequency of operation but makes
it difficult to schedule instructions without bubbles
- To few would result in a low frequency of operation
- Where to place the pipeline stages?
9/12/2012 CHES 2012, Leuven Belgium 14
Pipelining the Processor
- How many stages of pipelining ?
- To many would increase the frequency of operation but makes
it difficult to schedule instructions without bubbles
- To few would result in a low frequency of operation
- Where to place the pipeline stages?
2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH
(a) A Bad Pipline Stage Placement
2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH
(b) A Good Pipeline Stage Placement
9/12/2012 CHES 2012, Leuven Belgium 14
Pipelining the Processor
- How many stages of pipelining ?
- To many would increase the frequency of operation but makes
it difficult to schedule instructions without bubbles
- To few would result in a low frequency of operation
- Where to place the pipeline stages?
2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH
(a) A Bad Pipline Stage Placement
2n 2n Circuit 1 2n us D_sel C_sel MUX D E MUX A0 E_sel A2 A3 Circuit 2 Circuit Pow_sel B_sel MUX B A2 A3 A1 Quad A0 C0 Powerblock Squarer C1 C2 A3 A2 A1 A0 Squarer Quad A_sel MUX A A0 Field Multiplier (HBKM) C2 C1 C0 Qout Qin REGISTER BANK Registers Muxes Muxes Input Output MUX F C MUX ARITHMETIC UNIT CRITICAL PATH
(b) A Good Pipeline Stage Placement
Can we determine the best pipeline strategy at the design exploration phase?
9/12/2012 CHES 2012, Leuven Belgium 14
Optimally Pipeline into L Stages Apriori
1 Model the delay of the design 2 Use the model to identify all critical paths (with delay t) 3 Place pipeline registers to ensure that no path has delay
greater than t
L
9/12/2012 CHES 2012, Leuven Belgium 15
Modeling the Delay
- The delay of a circuit is proportional to the number of LUTs
in the critical path
4 5 6 7 8 9 10 5 6 7 8 9 10 11
- No. of LUTs in Critical Path
Delay of Circuit (in ns)
9/12/2012 CHES 2012, Leuven Belgium 16
Modeling the Delay
- The delay of a circuit is proportional to the number of LUTs
in the critical path
4 5 6 7 8 9 10 5 6 7 8 9 10 11
- No. of LUTs in Critical Path
Delay of Circuit (in ns)
- The number of LUTs in the critical path of an n input
Boolean function f (x1, x2, x3, · · · xn) is ⌈logk(n)⌉, where k is the number of inputs to the LUT.
9/12/2012 CHES 2012, Leuven Belgium 16
Modeling LUTdelay of all Elements in the Processor
Component k − LUT Delay for k ≥ 4 m = 163, k = 4 m bit field adder 1 1 m bit n : 1 Mux ⌈logk(n + log2n)⌉ 2 (for n = 4) (Dn:1(m)) 1 (for n = 2) Exponentiation max(LUTDelay(di )), where di is the ith 2 (for n = 1) Circuit (D2n (m))
- utput bit of the exponentiation circuit
2 (for n = 2) Powerblock us × D2n (m) + Dus :1(m) 4 (for us = 2) (Dpowerblk(m)) Modular Reduction 1 for irreducible trinomials 2 (Dmod ) 2 for pentanomials (for pentanomials) HBKM 11 (for τ = 11) (DHBKM(m)) Dsplit + Dthreshold + Dcombine + Dmod = ⌈logk( m
τ )⌉ + ⌈logk(2τ)⌉
+⌈log2 m
τ
⌉ + Dmod 9/12/2012 CHES 2012, Leuven Belgium 17
Critical Path : Placing Pipeline Registers Optimally
C_sel B_sel A3 A1 Powerblock
Squarer
C2 A3 A2 A1 A0 Field Critical Path Multiplier (HBKM) C2 C1 C0 Qout Qin
REGISTER BANK
Registers
ARITHMETIC UNIT (#2) (#1) (#1) (#1) (#2) (#2) (#11) (#2) (#2) (#2) (#2) (#2) (#2) (#2) (#2) (#2) (#2)
9/12/2012 CHES 2012, Leuven Belgium 18
Critical Path : Placing Pipeline Resgisters Optimally
Squarer Squarer
4:1 Input Mux Register Bank Merged Squarer and Adder HBKM
(#1)
MUX
B
MUX
D
Pipeline Register
(#11) (#2) (#6) (#6) (#5) (#6)
Register Bank 4:1 Output Mux
(#2) (#2) (#1) (#2) (#2)
4 stage pipeline
- Critical path length : 23 LUTs
- Time Period : ⌈23
4 ⌉ = 6 LUTs
9/12/2012 CHES 2012, Leuven Belgium 19
Scheduling of Operations (for 1 bit)
Table: Scheduling Instructions
ek
1 : Xi ← Xi · Zj
ek
4 : Zj ← (T · Zj )2
ek
2 : Zi ← Xj · Zi
ek
5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2
ek
3 : T ← Xj ; Xj ← X 4 j + b · Z4 j
ek
6 : Xi ← x · Zi + T
9/12/2012 CHES 2012, Leuven Belgium 20
Scheduling of Operations (for 1 bit)
Table: Scheduling Instructions
ek
1 : Xi ← Xi · Zj
ek
4 : Zj ← (T · Zj )2
ek
2 : Zi ← Xj · Zi
ek
5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2
ek
3 : T ← Xj ; Xj ← X 4 j + b · Z4 j
ek
6 : Xi ← x · Zi + T
ek
1
ek
2
ek
3
ek
5
ek
6
ek
4
(a) Data Dependencies Between Instructions
1 2 3 4 5 6 7 8 Clock Cycle ek
6
ek
5
ek
4
ek
3
ek
2 or ek 1
ek
1 or ek 2
(b) Timing Diagram for 3 Stage Pipeline
9/12/2012 CHES 2012, Leuven Belgium 20
Scheduling of 2 Consequtive bits : Overlapping two bits
When two consequtive bits are equal
sk−1 sk ek
1
ek
2
ek
3
ek
5
ek
6
ek
4
ek−1
1
ek−1
2
ek−1
3
ek−1
5
ek−1
6
ek−1
4 15 3 4 5 6 7 9 10 11 12 13 14 8 Clock Cycles 1 2
Stolen Clock Cycle
ek−1
6
ek−1
5
ek−1
4
ek−1
3
ek−1
2
ek−1
1
ek
6
ek
5
ek
4
ek
3
ek
2 or ek 1
ek
1 or ek 2
sk completes sk−2starts sk−1starts
9/12/2012 CHES 2012, Leuven Belgium 21
Scheduling of 2 Consequtive bits : Overlapping two bits
When two consequtive bits are NOT equal
sk−1 sk ek
1
ek
2
ek
3
ek
4
ek
6
ek
5
ek−1
1
ek−1
2
ek−1
3
ek−1
4
ek−1
5
ek−1
6 15 3 4 5 6 7 9 10 11 12 13 14 8 Clock Cycles 1 2
Stolen Clock Cycle
ek−1
6
ek−1
5
ek−1
4
ek−1
3
ek−1
2
ek−1
1
ek
6
ek
5
ek
4
ek
3
ek
2 or ek 1
ek
1 or ek 2
sk completes sk−1 starts sk−2 starts
9/12/2012 CHES 2012, Leuven Belgium 22
Clock Cycles for L Stage Pipeline
- Without overlapping 2L + 2 clock cycles are required per bit
- With overlapping, we save a clock cycle, so 2L + 1
- Additionally,
- If there is a pipline stage soon after the multiplier, then data
forwarding is feasible
- Clock cycles per bit is 2L
1 2 3 4 5 6 7 8 Clock Cycle 9
Data Forwarding Path
10 ek
6
ek
5
ek
4
ek
3
sk−1 starts
ek
1 or ek 2
ek
2 or ek 1
Clock cycles saved are m or 2m
9/12/2012 CHES 2012, Leuven Belgium 23
Modeling Computation Time to find the Right Pipeline
So, we now know
- How to place pipeline registers
- and estimate the clock cycles required
Putting it together, we can estimate how long it takes to perform a scalar multiplication
Table: Computation Time Estimates for Various Values of L for an ECM over GF(2163) and FPGA with 4
input LUTs
L us DataForwarding Computation Time Feasible 1 9 No 1019 2 4 No 524 3 3 No 412 4 2 Yes 357 5 1 No 395 6 1 Yes 360 7 1 Yes 358
9/12/2012 CHES 2012, Leuven Belgium 24
Architecture Details
1 Z 2 Z X1 X2
us 2n 2n Circuit 1 2n A0 A1 A2 A3 A1 A1 A1 A3 C MOD Combine Outputs Threshold Mults Split Operands Combine Level 1 Combine Level 4
{
STAGE 1 STAGE 2 STAGE 3 STAGE 4
REGISTER BANK ARITHMETIC UNIT
Clock Reset Scalar
B_sel C_sel D_sel E_sel Pow_sel A0_sel A1_sel A2_sel A3_sel Qin_sel X_sel Z_sel X1_w X2_w Z1_w Z2_w T_w A_w Write Enable Register Multiplexer Select Lines
y
Z_sel
Control Unit MUX Z
X X_sel 1 Base Point in ROM Curve Constant
x A T
A_sel Qin_sel A2_sel A1_sel A3_sel A0_sel Data Forwarding Path Powerblock
Circuit 2 Circuit
b
Qin A0 A1 A2 A3 A2
D MUX MUX
C_sel D_sel Qout A0 Pow_sel A0 A3 A2 A0 A_sel B_sel E_sel MUX F C0 C1 C2
Squarer Quad
Squarer Quad MUX MUX MUX E B A
HBKM
MUX G MUX H MUX I MUX J K
9/12/2012 CHES 2012, Leuven Belgium 25
Finite State Machine
Initialization Coordinate Conversion Completion Completion AD2 AD3 AD4 AD9 AD1
neq
AD4
1
AD2
1
AD3
1
AD1
eq
AD9
1
I0 I1 I5 AD1 C1 C2
sk−1 = sk sk−1 = sk sk−1 = sk sk−1 = sk sk−1 = 1 sk−1 = 0 st−2 = 1 st−2 = 0 sk−1 = 0 sk−1 = 1 C125
9/12/2012 CHES 2012, Leuven Belgium 26
Comparisons
Work Platform Field Slices LUTs Freq Comp. (m) (MHz) Time (µs) Orlando XCV400E 163
- 3002
76.7 210 Bednara XCV1000 191
- 48300
36 270 Gura XCV2000E 163
- 19508
66.5 140 Lutz XCV2000E 163
- 10017
66 233 Saqib XCV3200 191 18314
- 10
56 Pu XC2V1000 193
- 3601
115 167 Ansari XC2V2000 163
- 8300
100 42 Rebeiro XC4V140 233 19674 37073 60 31 J¨ arvinen1 Stratix II 163 (11800ALMs)
- 48.9
Kim 2 XC4VLX80 163 24363
- 143
10.1 Chelton XCV2600E 163 15368 26390 91 33 XC4V200 163 16209 26364 153.9 19.5 Azarderakhsh XC4CLX100 163 12834 22815 196 17.2 XC5VLX110 163 6536 17305 262 12.9 Our Result (Virtex 4 FPGA) XC4VLX80 163 8070 14265 147 9.7 XC4V200 163 8095 14507 132 10.7 XC4VLX100 233 13620 23147 154 12.5 Our Result (Virtex 5 FPGA) XC5VLX85t 163 3446 10176 167 8.6 XC5VSX240 163 3513 10195 148 9.5 XC5VLX85t 233 5644 18097 156 12.3
- 1. uses 4 field multipliers; 2. uses 3 field multipliers; 3. uses 2 field multipliers
9/12/2012 CHES 2012, Leuven Belgium 27
Conclusions
- We show the implementation of a high-speed elliptic curve
crypto-processor for FPGA platforms
- The use of highly optimized finite field primitives, efficient
utilization of FPGA primitives, help reduce the area, which in turn make routing easier
- A theoretically designed pipline strategy provides ideal
pipelining of the design to increase clock frequncy
- Efficient scheduling with data forwarding machanisms reduce
clock cycle requirement
- All these result in one of the fastest elliptic curve
implementation on FPGAs
- Additionally, area required is significantly less compared to
- ther high-speed designs
9/12/2012 CHES 2012, Leuven Belgium 28
Thank You for your Attention
9/12/2012 CHES 2012, Leuven Belgium 29