Search code examples
cperformancefor-loopiteration

Do for loops get slow the longer it iterates? Why are these two for-loop performances different?


I am currently creating a BIG_INTEGER struct of my own in pure C as a pastime project. One of the functions that I decided to implement into my project is a PRNG that returns a random BIG_INTEGER with a specified number of digits.

This is the BIG_INTEGER struct:

typedef struct BIG_INTEGER
{
    BOOL bHasSign, bInit;
    SIZE_T szLength;
    INT* arrnDigits;
} BIG_INTEGER, *PBIG;

This is the code I came up for the PRNG.

VOID birndl(PBIG pbigOut, ULONG szLength)
{
    if (pbigOut->bInit != 1)
    {
        // Initialize the input with binew() first.
        fprintf(stderr, "birndl!1: no content");
        abort();
    }
    else
        // Simply resets the values and frees the int array
        bidel(pbigOut);
    // Prepare
    pbigOut->bInit = TRUE;
    pbigOut->bHasSign = FALSE;
    pbigOut->szLength = szLength;
    pbigOut->arrnDigits = malloc(szLength * sizeof(INT));
    if (pbigOut->arrnDigits != NULL)
    {
        NTSTATUS ntErr;
        UCHAR buf[512];
        SIZE_T index1, index2, index3, index4;
        // Fill buffer with random bytes
        ntErr = BCryptGenRandom(NULL, buf, 512, BCRYPT_USE_SYSTEM_PREFERRED_RNG);
        if (!NT_SUCCESS(ntErr))
        {
            fprintf(stderr, "birndl!2: bcrypt failed with code %ld", ntErr);
            abort();
        }
        index1 = index2 = index3 = index4 = 0;
        *(pbigOut->arrnDigits) = (buf[index1] * buf[index2] * buf[index3] * buf[index4]) % 10;
        // Zero cannot be the first digit, since the resulting length would be szLength - 1
        // Slightly biased, but the show must go on; transfer the chances of getting a #0 to #5
        if (*(pbigOut->arrnDigits) == 0)
            *(pbigOut->arrnDigits) = 5;
        // By multiplying 4 selected UCHAR in the buffer mod 10, a digit from 0-9 is obtained
        for (SIZE_T _iter = 1; _iter < szLength; ++_iter)
        {
            // Select a different UCHAR from the buffer for every iteration
            if (index1 == 512 && index2 < 512)
            {
                index2++;
                index1 = 0;
            }
            else if (index1 == 512 && index2 == 512 && index3 < 512)
            {
                index3++;
                index2 = 0;
            }
            else if (index1 == 512 && index2 == 512 && index3 == 512 && index4 < 512)
            {
                index4++;
                index3 = 0;
            }
            else if (index1 == 512 && index2 == 512 && index3 == 512 && index4 == 512)
            {
                fprintf(stderr, "birndl!3: exceeding capability");
                abort();
            }
            else
                index1++;
            *(pbigOut->arrnDigits + _iter) = (buf[index1] * buf[index2] * buf[index3] * buf[index4]) % 10;
        }
    }
    else
    {
        fprintf(stderr, "birndl!4: memory error");
        abort();
    }
}

For a random thought, I wanted to test the performance about how fast can it generate digits, so what I did is pretty straightforward.

LARGE_INTEGER frequency, start, end;
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start);
ULONGLONG length = 10000000;
BIG_INTEGER num;
binew(NULL, &num);
birndl(&num, length);
PCHAR str = bistr(&num);
bidel(&num);
free(str);
QueryPerformanceCounter(&end);
printf("QPC: Program took %fs to finish. (digits=%I64u)\n",
    (double)((end.QuadPart - start.QuadPart)) / frequency.QuadPart, length);
_CrtDumpMemoryLeaks();

Note that I tried this with other powers of 10 lower than 10M, too. This is the summary of its performance with the same troubleshooting method, except I changed the value of length accordingly.

1-10 K digits: instantly
100 K digits: 0.5~0.6 seconds
1 M digits: 47~53 seconds
2 M digits: ~197 seconds
10 M digits: >10 minutes and still not finished

With common sense, who generates 10M digits, right? However, I wanted to know if there is a better, faster way to generate such big amounts. I thought, maybe if the for-loop decays in speed as the iteration progresses, then maybe I can split the values first, pass it through birndl with lesser iterations, convert those BIG_INTEGER results into a string, and finally concatenate them into one big string. Here's the entry point code along with bistr, which converts my BIG_INTEGER into a string.

INT main(VOID)
{
    LARGE_INTEGER frequency, start, end;
    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&start);
    // Compute necessary values
    ULONGLONG aim_digits = 10000000;
    UINT divisions = 2500;
    ULONGLONG per_division = aim_digits / divisions;
    // Create an array of BIG_INTEGER values
    BIG_INTEGER* list = malloc(sizeof(BIG_INTEGER) * divisions);
    // Store the string forms of the BIG_INTEGER values here
    // Last element (hence +1) would contain the concatenation of all strings in the array
    PCHAR* strlist = malloc(sizeof(PCHAR*) * (divisions + 1));
    assert(list != NULL && strlist != NULL);
    // Iterate for each division
    for (SIZE_T i = 0; i < divisions; ++i)
    {
        binew(NULL, &list[i]);
        birndl(&list[i], per_division);
        // Convert generated portion into a string
        strlist[i] = bistr(&list[i]);
    }
    // Allocate
    strlist[divisions] = malloc(aim_digits + 1);
    assert(strlist[divisions] != NULL);
    // Start concatenating everything
    strcpy(strlist[divisions], strlist[0]);
    for (SIZE_T i = 1; i < divisions; ++i)
        strcat(strlist[divisions], strlist[i]);
    // Make sure to free all the used memory by the end
    for (SIZE_T i = 0; i < divisions; ++i)
    {
        bidel(&list[i]);
        free(strlist[i]);
    }
    free(strlist[divisions]);
    free(strlist);
    free(list);
    QueryPerformanceCounter(&end);
    printf("QPC: Program took %fs to finish. (digits=%I64u)\n",
        (double)((end.QuadPart - start.QuadPart)) / frequency.QuadPart, aim_digits);
    _CrtDumpMemoryLeaks();
    return 0;
}

PCHAR bistr(PBIG pbigIn)
{
    if (pbigIn->bInit != 1)
    {
        fprintf(stderr, "bistr!1: no content");
        abort();
    }
    PCHAR pstrOut = calloc(pbigIn->szLength + pbigIn->bHasSign + 1, sizeof(CHAR));
    if (!pstrOut)
    {
        fprintf(stderr, "bistr!2: memory error");
        abort();
    }
    if (pbigIn->bHasSign)
        strcat(pstrOut, "-");
    for (SIZE_T i = 0; i < pbigIn->szLength; ++i)
    {
        CHAR digit[2];
        snprintf(digit, 2, "%d", pbigIn->arrnDigits[i]);
        strcat(pstrOut, digit);
    }
    return pstrOut;
}

I did a test run "logarithmically" with around 4000 digits per division, and this gave extremely amazing results!

d - digits; v - divisions
10d,5v:        }
100d,25v:      }
1Kd,25v:       } instantly
10Kd,25v:      }
100Kd,25v:     }
1Md,250v: 0.5~0.6 seconds
10Md,2500v: 6.6~7.0 seconds
100Md,25000v: ~197 seconds

This is definitely a lot faster than the straightforward way of using the PRNG. It generates 100 million digits within same time as the first method took when it tried to generate 2 million!

However, as the title suggests, I also wonder why birndl causes the for-loop to deteriorate in speed over time. Is there a way to optimize it so that it performs the same or at least nearly the same way as the "split method", or is it just the nature of for-loops to slow down over time as it iterates that one should keep the number of iterations low to maintain efficiency?

Why would the second method be faster, which uses birndl many times among other extra steps (bistr, strcat) but with less iterations passed, compared to the first method which simply tells birndl to generate a very big amount of digits in one go?

Thank you!


Solution

  • For loops do not inherently get slower over time.

    I can't be certain what is really happening on your machine but with that shape of workload your measurements do not surprise me. At low enough counts all the memory required will be hitting cache lines within the CPU which is extremely fast. As the count increases you start hitting system RAM which is considerably slower and (because the code & data structure you're using are not designed to mitigate this that I can see) for large enough counts you will run into cache collisions and prefetch/expiry misprediction which will be a further step slower. Beyond that you may start using virtual memory that's backed by something other than system RAM which will be another step slower.

    Some other hints:

    1. Try using half a byte for each decimal digit rather than using an int, you should find that uses less memory and runs faster.
    2. Your PRNG won't produce evenly-distributed results. An example of a fairer and faster algorithm would take 4 random bits from your buffer, use that as the digit value if it's < 10, and discard it otherwise. Additionally with either algorithm you should call BCryptGenRandom again when you exhaust the buffer.