Search code examples
sortingbytemegabyte

Confusion calculating size between Bytes and Megabytes


This is a solved problem where the we have to find a missing integer from the input data which is in form of a file containing 4 billion unsorted unsigned integers. The catch is that only 10 MBytes of memory can be used.

The author gives a solution where we do 2 passes. In the first pass we create an array of "blocks" of some size. Eg. Block 0 represents 0 to 999, 1 represents 1000 to 1999, so on. Thus we know how many values should be present in each block.

Now we scan through the file and count how many values are actually present in each block - which will lead us to the missing integer.

I understand the approach. However, to calculate the appropriate block size starts with this:

The array in the first pass can fit in 10 MBytes, or roughly 2^23 bytes, of memory.

How is 10 MB the same as 2^23? I am unable to understand where did the 2^23 number come from.

So 2^24 is 16MBytes, so probably he is taking 2^23 which is closest value that is 10 MBytes or less. If that is the case, why can he not use the whole 10 MBytes? i.e. why does he have to use a power of 2 to get the size here?


Solution

  • Not stated, but I'm assuming that the problem is to find the one missing integer from a file with a set of 2^32 (4294967296) unique 32 bit unsigned integers of values 0 to 4294967295. The file takes 17179869184 bytes of space.

    Using 2^23 = 8388608 of memory space allows for 2^21 = 2097152 32 bit unsigned integer counts to be held in memory. Each group represents (2^32)/(2^21) = 2^11 = 2048 values. So count[0] is for values 0 to 2047, count[1] is for values 2048 to 4095, ... , count[2097151] is for values 4294965248 to 4294967295. The main line in the inner part of the loop would be count[value>>21] += 1. All counts will be 2048 except the count corresponding to the missing integer, which will have a count of 2047. The second pass will only need to read the 2047 values in the proper range to find the missing integer.

    An alternative would be to use 4194304 16 bit counts with each group representing 1024 values (max count value is 1024), but there's no significant difference in time.

    Assuming the 10MB = 10 * 2^20 = 10485760, then 10 * 2^18 = 2621440 32 bit counts could be used, each count for a range of 1639 values (the last count for a smaller group), which is awkward. If using 16 bit counts, then a range of 3278 values, also awkward.