Search code examples
coverflowmatrix-multiplicationinteger-overflow

How to convert float to int in C and then back after performing operations while avoiding overflow?


I am working on a project where I need to implement a Neural Network on a microcontroller in C, and execution time is critical. I am trying to try techniques to speed up the running of the code, and one thing I found works is doing integer math instead of floating point math. This worked really well in terms of speed, however, when I go and check, the actual mathematics is wrong, which is a problem.

I have MATLAB code that computes the NN using floating points, and then also computes it using integer math. I do this by scaling the input, weights and biases by a factor 2^15, and follow this procedure:

activation((scaled_input(1) * scaled_weight(1) / scale_factor) + scaled_bias(1));

This yields quite good accuracy when comparing floating to integer math. I also implemented this function (for floats) in C as follows (weights and biases are pre-scaled):

void NN_relu(int r1, int c1, float m1[][c1], int r2, int c2, const float m2[][c2], const float bias[][c2], float result[][c2]) {
    for (int i = 0; i < r1; i++) {
        for (int j = 0; j < c2; j++) {
            result[i][j] = bias[i][j];

            for (int k = 0; k < c1; k++) {
                result[i][j] += m1[i][k] * m2[k][j];
            }

            result[i][j] = (result[i][j] > 0) ? result[i][j] : 0;
        }
    }
}

This also works well, getting the exact same value as MATLAB does.

However, my implementation of the scaling does not work. It results in the following outputs for the first layer:

69259 0 0 24448 34106 64463 71738 0 55807 0

as opposed to the correct scaled output from MATLAB:

69259   0   0   24448   0   0   596026  0   0   0

Here is the scaled C function in question

void NN_relu(int r1, int c1, int m1[][c1], int r2, int c2, const int m2[][c2], const int bias[][c2], int result[][c2]) {
    for (int i = 0; i < r1; i++) {
        for (int j = 0; j < c2; j++) {
            result[i][j] = bias[i][j];

            for (int k = 0; k < c1; k++) {
                result[i][j] += (m1[i][k] * m2[k][j]) / S; // S = 2^15
            }

            result[i][j] = (result[i][j] > 0) ? result[i][j] : 0;
        }
    }
}

I have deduced the reason for the error in some but not all elements of the result is due to overflow, I have confirmed this by debugging using print statements, ie, for correct values I get:

m1[0][0]: 32768, m2[0][0]: 4752, product: 155713536, scaled_product: 4752

And for incorrect ones I get:

m1[0][0]: 32768, m2[0][4]: 101526, product: -968163328, scaled_product: -29546

As you can see, the scaled product should just be 101526 again, but it isn't because of integer overflow.

I have tried using 2^31 or other scaling factors, but my MATLAB code gives horrible accuracies for those, like several orders of magnitude off.

Is there any alternative method to handling this? Or should I try another approach to speeding up my code execution?


Solution

  • The particular approach you describe for using integers to perform non-integer arithmetic is called fixed-point arithmetic.

    When considering or using fixed-point arithmetic, the essential first questions are what range of values (maximum and minimum) you need to support and what precision (number of fractional digits) you require. These are first and foremost a question of your data and expected computational results, including intermediate results, not a matter of the details of your algorithm. Your scale factor of 215 corresponds to 15 binary digits of fraction, or about 4-5 decimal digits. If you need to be more precise than that then you need a larger scale, but that would reduce the numeric range of values you can represent with the same total number of bits. If you can afford to be less precise then a smaller scale factor will give you more range.

    Supposing that 32-bit int (as it appears yours are) with 15 bits designated for a fractional part is well matched to your data and all computational results, you're in fairly good shape. The maximum number of significant digits in the product mn is the sum of the number of significant digits in m and n, so to avoid integer overflow in your fixed-point multiplication, it would suffice to coerce the operands to 64 bits, perform a 64-bit multiplication, and then scale down the result. For example:

                    result[i][j] += ((long long int) m1[i][k] * m2[k][j]) / S;
    

    That the final conversion back to type int will never overflow is one of the facets of the chosen fixed-point representation being suitable for your data. That is, you need to take it into consideration in choosing the underlying integer type and scale.

    Additional notes:

    • Converting the underlying integer type to a wider one instead of performing just the multiplication with a wider type merely kicks the can down the road. It might work for you, in your particular computation, but only in the sense that your particular numbers don't trigger an overflow. It does not remove the risk of an overflow.

    • That might be ok. However, you may find that the narrower type is faster, because it uses half as much memory, and moving data from memory to CPU and back is costly enough to care about. If your matrices are large, and if int suffices but for the one fixed-point multiply you asked about, then consider sticking with int as the underlying type, and widen only the transient intermediate result of the integer multiplication.

    • I used long long int instead of long int because the former is guaranteed to have at least 64 bits, but the latter is not. In a case such as this, where the sizes matter, you should consider using the explicit-width integer types from stdint.h. In particular, int32_t and / or int64_t.