Search code examples
algorithmsortingbucket-sort

Near perfect distribution model for Bucket Sort


I was trying to understand the algorithm for bucket sort and it occurred to me that without the right distribution model, we can get a complexity of O(n^2). Quite a few websites have the number of buckets equal to the size of the array (say 'n') and use the algorithm

std::vector<float> bucket[n];
for (int i = 0; i<n; i++){
  bucket[(array[i]*n)/(MAX_ELEMENT_IN_INPUT_ARRAY+1)].push_back(array[i]);
}

I understand that integers can be random and there's no perfect hashing algorithm but I don't quite get how the above algorithm can equal distribute the elements into their respective buckets. Is there a straight forward logic I'm missing out on?


Solution

  • The above code does not guarantee an even distribution. As an example, suppose that you have an input array consisting of n elements, the numbers 1, 2, 4, 8, 16, 32, ..., 2n-1. Now, let's think about where these elements will end up. Let's pick one element, say, 2k. Its bucket index is given by

    2k · n / (2n-1 + 1)

    The cause for alarm here is that 1 / (2n - 1) is a very, very small number compared to n. As a result, we'd expect that most of the elements are going to get dropped into very low bucket numbers and that we'll have bad dispersion.

    Let's try this out on 1, 2, 4, 8, 16, 32, 64, 128. We're going to have 8 total buckets. The elements get mapped as follows:

    • 1 gets put into bucket 1 * 8 / 129 = 8 / 129 = 0
    • 2 gets put into bucket 2 * 8 / 129 = 16 / 129 = 0
    • 4 gets put into bucket 4 * 8 / 129 = 32 / 129 = 0
    • 8 gets put into bucket 8 * 8 / 129 = 64 / 129 = 0
    • 16 gets put into bucket 16 * 8 / 129 = 128 / 129 = 0
    • 32 gets put into bucket 32 * 8 / 129 = 256 / 129 = 1
    • 64 gets put into bucket 64 * 8 / 129 = 512 / 129 = 3
    • 128 gets put into bucket 128 * 8 / 129 = 1024 / 129 = 7

    As you can see, five of the eight elements here got dropped into bucket 0, and most of the buckets are unused.

    More generally, if you have n elements with this sequence, then only the buckets n - 1, (n - 1) / 2, (n - 1) / 4, (n - 1) / 8, etc. will ever get used. There are only about log n buckets of this form, meaning that about n - log n elements will get dropped into bucket 0 and only about log n elements will be in other buckets.

    To the best of my knowledge, there is no one formula that will always give you a good distribution. The formula given here works well if you assume that the numbers are uniformly distributed over an interval, and as you can see, if you give exponentially distributed numbers, you end up with a pretty bad worst-case behavior.