Search code examples
algorithmfrequency

A large file containing 1 million integers, what would be the fastest way to find the most occurring?


Basic approach would be to use an array or a hashmap to create a historgram of numbers and select the most frequent.

In this case let's assume that all the numbers from the file cannot be loaded into the main memory.

One way I can think of is to sort using external merge/quick sort and then chunk by chunk calculate the frequency. As they are sorted, we don't have to worry about the number appearing again after the sequence with a number finishes.

Is there a better and more efficient way to do this?


Solution

  • Well, a million isn't so much anymore, so lets assume we're talking about several billion integers.

    In that case, I would suggest that you hash them and partition them into 2^N buckets (separate files or preallocated parts of the same file) using the top N bits of their hash values.

    You would choose N so that the resulting buckets were highly likely to be small enough to process in memory.

    You would then process each bucket by counting the occurrences of each unique value in a hash table or similar.

    In the unlikely event that a bucket has too many unique values to fit in RAM, repartition using the next N bits of the hash and try again.