RSA hardware implementation: radix-2 montgomery multiplication issues

I'm implementing RSA 1024 in hardware (xilinx ZYNQ FPGA), and am unable to figure out a few curious issues. Most notably, I am finding that my implementation only works for certain base/exponent/modulus combinations, but have not found any reason why this is the case.

Note: I am implementing the algorithm using Xilinx HLS (essentially C code that is synthesized into hardware). For the sake of this post, treat it just like a standard C implementation, except that I can have variables up to 4096 bits wide. I haven't yet parallelized it, so it should behave just like standard C code.

The Problem

My problem is that I am able to get the correct answer for certain modular exponentiation test problems, but only if the values for the base, exponent, and modulus can be written in much fewer bits than the actual 1024 bit operand width (i.e. they are zero padded).

When I use actual 1024-bit values generated from SSH-keygen, I no longer get the correct results.

For example, if my input arguments are

uint1024_t base     = 1570
uint1024_t exponent = 1019
uint1024_t modulus  = 3337

I correctly get a result of 1570^1029 mod(3337) = 688

However, when I actually use values that occupy all (or approximately all) 1024 bits for the inputs...

uint1024_t base     = 0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec920399f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3faf83c86bfdd6e9daad12559f8d2747
uint1024_t exponent = 0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab620fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafca72c9f3ca5bbf96b24c1345eb936d1
uint1024_t modulus  = 0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2db0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7030c5c004c5aea3cf99afe89b86d6d

I incorrectly get a massive number, rather than the correct answer of 29 (0x1D)

I've checked both algorithms a million times over, and have experimented with different initial values and loop bounds, but nothing seems to work.

My Implementation

I am using the standard square and multiply method for the modular exponentiation, and I chose to use the Tenca-Koc radix-2 algorithm for the montgomery multiplication, detailed in pseudocode below...

/* Tenca-Koc radix2 montgomery multiplication */
Z = 0
for i = 0 to n-1
    Z = Z + X[i]*Y
    if Z is odd then Z = Z + M
    Z = Z/2  // left shift in radix2
if (S >= M) then S = S - M

My Montgomery multiplication implementation is as follows:

void montMult(uint1024_t X, uint1024_t Y, uint1024_t M, uint1024_t* outData)
{
    ap_uint<2*NUM_BITS> S = 0; 

    for (int i=0; i<NUM_BITS; i++)
    {
        // add product of X.get_bit(i) and Y to partial sum
        S += X[i]*Y; 

        // if S is even, add modulus to partial sum
        if (S.test(0))  
            S += M;     

        // rightshift 1 bit (divide by 2)
        S = S >> 1;
    }

    // bring back to under 1024 bits by subtracting modulus
    if (S >= M)
        S -= M;

    // write output data
    *outData = S.range(NUM_BITS-1,0);

}

and my top-level modular exponentiation is as follows, where (switching notation!) ...

// k: number of bits
// r = 2^k (radix)
// M: base
// e: exponent
// n: modulus
// Mbar: (precomputed residue) M*r mod(n)
// xbar: (precomputed initial residue) 1*r mod(n)

void ModExp(uint1024_t M, uint1024_t e, uint1024_t n, 
            uint1024_t Mbar, uint1024_t xbar, uint1024_t* out)
{
    for (int i=NUM_BITS-1; i>=0; i--)
    {
        // square
        montMult(xbar,xbar,n,&xbar);

        // multiply   
        if (e.test(i)) // if (e.bit(i) == 1)
            montMult(Mbar,xbar,n,&xbar);
    }
        // undo montgomery residue transformation
        montMult(xbar,1,n,out);
}

I can't for the life of me figure out why this works for everything except an actual 1024 bit value. Any help would be much appreciated

Solution

Update: I finally was able to fix the issue, after I ported my design to Java to check my intermediate values in the debugger. The design ran flawlessly in Java with no modifications to the code structure, and this tipped me off as to what was going wrong.

The problem came to light after getting correct intermediate values using the BigInteger java package. The HLS arbitrary precision library has a fixed bitwidth (obviously, since it synthesizes down to hardware), whereas the software BigInteger libraries are flexible bit widths. It turns out that the addition operator treats both arguments as signed values if they are different bit-widths, despite the fact that I declared them as unsigned. Thus, when there was a 1 in the MSB of an intermediate value and I tried to add it to a greater value, it treated the MSB as a sign bit and attempted to sign extend it.

This did not happen with the Java BigInt library, which quickly pointed me towards the problem.

If anyone is interested in a Java implementation of modular exponentiation using the Tenca-Koc radix2 algorithm for montgomery multiplication, you can find the code here: https://github.com/bigbrett/MontModExp-radix2