Search code examples
algorithmnumbersnumericnumerical-methodsdivide-and-conquer

Given a file containing 4.30 billion 32-bit integers, how can we find a number, which has appeared at least twice?


I have come up with divide and conquer algorithm for this. Just wanted to know if this would work or not?

First mid is calculated from the integer range i.e. (0+(1<<32-1))>>1 and then this idea is applied: range of number from start to mid or from mid to end will always be less than the numbers we are going to consider as we are considering billion numbers and there will definitely some numbers which are going to be repeated as the range of 32bit integer is much smaller compare to billion numbers.

def get_duplicate(input, start, end):  
  while True:
    mid = (start >> 1) + end - (end >> 1)
    less_to_mid = 0
    more_to_mid = 0
    equal_to_mid = 0
    for data in input:
        data = int(data, 16)
        if data < mid:
            less_to_mid += 1
        elif data == mid:
            equal_to_mid += 1
        else:
            more_to_mid += 1
    if equal_to_mid > 1:
        return mid
    elif mid-start < less_to_mid:
        end = mid-1
    elif end-mid < more_to_mid:
        start = mid+1

with open("codes\output.txt", 'r+') as f:
  content = f.read().split()
  print(get_duplicate(content, 0, 1<<32-1))

I know we can use bit array but I just want to get your views on this solution and if implementation is buggy.


Solution

  • Your method is OK. But you will probably need to read the input many times to find the answer.

    Here is a variant, which allows you to find a duplicate with few memory, but you only need to read the input twice.

    1. Initialize an array A[65536] of integers to zero.
    2. Read the numbers one by one. Every time a number x is read, add 1 to A[x mod 65536].
    3. When the reading ends, there will be at least one i such that A[i] is strictly bigger than 65536. This is because 65536 * 63356 < 4.3 billion. Let us say A[i0] is bigger than 65536.
    4. Clear the array A to zero.
    5. Read the numbers again, but this time, only look at those numbers x such that x mod 65536 = i0. For every such x, add 1 to A[x / 65536].
    6. When the reading ends, there will be at least one j such that A[j] is strictly bigger than 1. Then the number 65536 * j + i0 is the final answer.