I need to calculate the speed difference between performing a Montgomery Multiplication page 602-603 with a word-size/register of size 32 vs. 64.
So far, this is what I understand:
How can I put all this together to represent the number of clock cycles needed in Montgomery multiplication on a computer with a 32-bit register or 64-bit register?
The number of cycles for a multiple-precision Montgomery multiplication would indeed be n(2+2*n)
if all the intermediate single-precision multiplication operands and results were available in registers. For cryptographic operations this is hardly possible since m
is usually 1024 or larger. Assuming 32-bit registers (xyR^-1 mod m) would require 192 registers only to store the operands (3*(1024/32)). In fact you need to take into account memory accesses to answer this question.
A rewrite of the algorithm with memory accesses (assuming multiplications can be done in parallel with loads/stores):
i
from 0 to n: a_i <- 0
i
from 0 to (n − 1) do the following:
a_0
y_0
x_i
u_i <- (a_0 + x_i*y_0)m' mod b
. Store u_i
in a registerc = 0
(Computing A <- (A + x_i*y + u_i*m)/b
)j
from 0 to (n-1):
a_j
y_j
(cv) = a_j + x_i*y_j + c
, Fetch m_j
(cv) = (cv) + u_i*m_j
, if j>0 Store a_{j-1} <- v
a_n <- c
and a_{n-1} <- v
A >= m
then A <- A − m
.Hope this helps.