Why does this program run into an infinite loop when I implement radix sort with base 2^20 to sort an array of size 5 million?

I am going to try and explain the problem as clearly as possible:

The user enters 2 numbers, n and q.
we take the first n fibonacci numbers and modulus each of them with q and put the modulus result in arr. So arr now has n elements. Till here the program works just fine.
We now have to sort arr using radix sort. When the test cases are small, for example n=5 q=100, n=15 q=13, n=1000000 q=1000000, the radix sort works just fine and i get the correct output.
the program runs into an infinite loop while sorting when n=5000000 and q=1000000000 (array length is 5000000) and largest number in the array is 999999973.
After sorting sorting there are some modulo arithmetic calculations which can be ignored as they are working fine and have no error.

Can someone kindly help me to check my sorting algorithm? Also right now I have chosen base as 2^20 for radix sort. On what basis do we choose the the base? length of largest number in the array? The correct output for n=5000000 and q=1000000000 is 973061125 (just for reference if anyone decides to run the program and check).

#include<iostream>
#include<algorithm>

using namespace std;


void countsort(long int* arr, long int n,long int shift) 
{
    long int* count = new long int[1048576];
    for (int i = 0; i < 1048576; i++)
        count[i] = 0;
    long int *output=new long int[n];
    long int i, last;

    for (i = 0; i < n; i++) 
    {
        ++count[(arr[i] >> shift) & 1048575];
    }
    for (i = last = 0; i < 1048576; i++)
    {
        last += count[i];
        count[i] = last - count[i];
    }

    for (i = 0; i < n; i++)
    {
        output[count[(arr[i] >> shift) & 1048575]++] = arr[i];
    }
    for (i = 0; i < n; i++)
    {
        arr[i] = output[i];
    }
    delete[] output;
    delete[] count;
}

int main()
{
    int trials = 0;
    cin >> trials;
    while (trials--)
    {
        long int n = 0;
        long int q = 0;
        cin >> n;
        cin >> q;

        long int first = 0, second = 1, fib = 0;
        long int* arr = new long int[n];
        arr[0] = second;
        long int m = 0;
        for (long int i = 1; i < n; i++)
        {
            fib = (first + second) % q;
            first = second;
            second = fib;
            arr[i] = fib;
            if (m < arr[i])
                m = arr[i];
        }
        //m is the largest integer in the array
        // this is where radix sort starts
        for (long int shift = 0; (m >> shift) > 0; shift += 20)
        {
            countsort(arr, n, shift);
        }

        long long int sum = 0;
        for (long int i = 0; i < n; i++)
        {
            sum = sum + ((i + 1) * arr[i]) % q;
        }

        sum = sum % q;
        cout << sum << endl;
    }
}

Solution

The infinite loop problem is in this line

    for (long int shift = 0; (m >> shift) > 0; shift += 20)

Assuming this is being run on a X86 processor, only the lower bits of a shift count are used, so for a 32 bit integer, only the lower 5 bits (0 to 31) of a shift count are used, and for a 64 bit integer, only the lower 6 bits (0 to 63) are used. Most compilers will not compensate for this limitation. (The original 8086/8088/80186 did not mask the shift count, this started with the 80286).

The other issue from your prior question was that (i + 1) * arr[i] can be greater than 32 bits. The prior question had sum defined as long long int. The code could also have i defined as long long int (or it could use a cast before doing the multiply). Fixes noted in comments. I don't know if sum is supposed to be % q, so I left it as a long long int value.

        for (int shift = 0; m > 0; shift += 20) // fix
        {
            countsort(arr, n, shift);
            m >>= shift;                        // fix
        }

        long long int sum = 0;                  // fix (long long)
        for (long long int i = 0; i < n; i++)   // fix (long long)
        {
            sum = sum + ((i + 1) * arr[i]) % q;
        }
        sum = sum;
        cout << sum << endl;

I don't know what compiler you are using, so I don't know if long int is 32 bits or 64 bits. Your prior questions code declared sum as long long int, which is used to declare a 64 bit integer in the case of Visual Studio. I don't know about other compilers. If long int is 32 bits, then this like is a potential problem:

    fib = (first + second) % q;

because first + second can sum up to be a negative number and the sign of the remainder will be the same as the sign of the dividend. Negative numbers will be an issue for the radix sort code you are using. Declaring fib as an unsigned int or as a long long int will avoid this issue.

As for choosing a base, it would probably be better to have all of the logic in countsort, and rename it to radix sort. Using base 2^8 and making 4 passes would be faster (due to the counts / indexes fitting in L1 cache). As I mentioned above, both arr and output should be declared as unsigned ints. The direction of radix sort would change with each of the 4 passes: arr->output, output->arr, arr->output, output->arr, eliminating the need to copy.

Another optimization is a hybrid MSD (most significant digit) / LSD (least significant digit) radix sort for an array much larger than all of cache. Assuming base 2^8 == 256 is used, then the first pass create 256 logical bins, each of which would then fit within cache, and then each of the 256 logical bins are sorted using 3 LSD radix sort passes. On my system (Intel 3770K, Win 7 Pro 64 bit), it was less than a 6% reduction in time for sorting 36 million 32 bit unsigned integers, from .37 seconds down to .35 seconds, a point of diminishing returns.