Is is possible to do a counting sort in parallel and achieve O(n/p) runtime?
Take an example where we have an array with millions of elements that range from 1-10. A merge sort will run in no better than O(nlogn) time. A counting sort applied to this problem will run in O(n) time. Parallelizing a counting sort could be interesting. If we assign a subarray with n/p elements to each processor and each processor has its own count array of size 9, the initial step where element counts are accumulated should take O(n/p) time. Consolidating all the count arrays into a single array should take O(p) time as you are only iterating p count arrays, each a constant size.
I haven't been able to fully think through the last step in the counting sort where the elements are placed in order. If the elements of the count array are atomic, you could assign n/p sections of the original array to individual processors and achieve some parallelization, but there would be contention at individual elements of the count array, potentially substantially reducing parallelization. If the input array is all 10's, all processors would serialize on the 9th element of the count array, reducing algorithmic efficiency to O(n).
You could assign subarrays of the count array to each of p processors and you are back to O(n/p) runtime, but only if the elements are distributed fairly evenly. And, in our example, you would be limited to 10 processors. If the elements are not distributed evenly, one or more processors could be doing a larger proportion of the work. For instance, if half of the elements in the input array are 10, one processor would have to step through half the array. Worst case, the array is all 10's and a single processor would have to step through the entire array devolving runtime to O(n).
Maybe you could divide individual elements of the count array among multiple processors. For instance, if there are 50 10's in the input array, element 9 of the count array would reflect this. You could have 5 processors write 10 10's each to the proper positions in the output array. This again devolves to O(n) runtime if there are fewer than p elements at each index location of the count array, but it avoids the problem where distribution of element values is uneven.
Is it possible to do a counting sort in O(n/p) time?
Yes, it is possible. Divide your array in p
parts of equal length. Then create a counting array 'c' for each process. Let each process count the number of elements and store them in c
. This will take O(n/p)
. Now add all counting arrays c
together and make the array shared to all processes. This will take O(p*b)
, where b
is the number of possible values. So far this is exactly your approach. Now you can recreate the array in p
processes since you can calculate the first and last index of a value from c
. For each value i
its first index is the sum of all previous values in c
. Its last index is its first index plus c[i]
. This calculation can be done in O(i)
where i
is smalleer then b
, so it is less then O(b)
. Each process can now repopulate its own part. This again takes O(n/p)
. To sum all up, you have n/p + p*b + b + n/p
. If p*b << n
it will result in O(2*n/p)
. (Since 2/p
is a constant factor, you still have the class O(n)
. But the parallelisation will significantly speed up your algorithm.)