Search code examples
arraysalgorithmdata-structuresdynamic-programmingsubset-sum

Smallest number that cannot be formed from sum of numbers from array


This problem was asked to me in Amazon interview -

Given a array of positive integers, you have to find the smallest positive integer that can not be formed from the sum of numbers from array.

Example:

Array:[4 13 2 3 1]
result= 11 { Since 11 was smallest positive number which can not be formed from the given array elements }


What i did was :

  1. sorted the array
  2. calculated the prefix sum
  3. Treverse the sum array and check if next element is less than 1 greater than sum i.e. A[j]<=(sum+1). If not so then answer would be sum+1

But this was nlog(n) solution.

Interviewer was not satisfied with this and asked a solution in less than O(n log n) time.


Solution

  • Consider all integers in interval [2i .. 2i+1 - 1]. And suppose all integers below 2i can be formed from sum of numbers from given array. Also suppose that we already know C, which is sum of all numbers below 2i. If C >= 2i+1 - 1, every number in this interval may be represented as sum of given numbers. Otherwise we could check if interval [2i .. C + 1] contains any number from given array. And if there is no such number, C + 1 is what we searched for.

    Here is a sketch of an algorithm:

    1. For each input number, determine to which interval it belongs, and update corresponding sum: S[int_log(x)] += x.
    2. Compute prefix sum for array S: foreach i: C[i] = C[i-1] + S[i].
    3. Filter array C to keep only entries with values lower than next power of 2.
    4. Scan input array once more and notice which of the intervals [2i .. C + 1] contain at least one input number: i = int_log(x) - 1; B[i] |= (x <= C[i] + 1).
    5. Find first interval that is not filtered out on step #3 and corresponding element of B[] not set on step #4.

    If it is not obvious why we can apply step 3, here is the proof. Choose any number between 2i and C, then sequentially subtract from it all the numbers below 2i in decreasing order. Eventually we get either some number less than the last subtracted number or zero. If the result is zero, just add together all the subtracted numbers and we have the representation of chosen number. If the result is non-zero and less than the last subtracted number, this result is also less than 2i, so it is "representable" and none of the subtracted numbers are used for its representation. When we add these subtracted numbers back, we have the representation of chosen number. This also suggests that instead of filtering intervals one by one we could skip several intervals at once by jumping directly to int_log of C.

    Time complexity is determined by function int_log(), which is integer logarithm or index of the highest set bit in the number. If our instruction set contains integer logarithm or any its equivalent (count leading zeros, or tricks with floating point numbers), then complexity is O(n). Otherwise we could use some bit hacking to implement int_log() in O(log log U) and obtain O(n * log log U) time complexity. (Here U is largest number in the array).

    If step 1 (in addition to updating the sum) will also update minimum value in given range, step 4 is not needed anymore. We could just compare C[i] to Min[i+1]. This means we need only single pass over input array. Or we could apply this algorithm not to array but to a stream of numbers.

    Several examples:

    Input:       [ 4 13  2  3  1]    [ 1  2  3  9]    [ 1  1  2  9]
    int_log:       2  3  1  1  0       0  1  1  3       0  0  1  3
    
    int_log:     0  1  2  3          0  1  2  3       0  1  2  3
    S:           1  5  4 13          1  5  0  9       2  2  0  9
    C:           1  6 10 23          1  6  6 15       2  4  4 13
    filtered(C): n  n  n  n          n  n  n  n       n  n  n  n
    number in
    [2^i..C+1]:  2  4  -             2  -  -          2  -  -
    C+1:              11                7                5
    

    For multi-precision input numbers this approach needs O(n * log M) time and O(log M) space. Where M is largest number in the array. The same time is needed just to read all the numbers (and in the worst case we need every bit of them).

    Still this result may be improved to O(n * log R) where R is the value found by this algorithm (actually, the output-sensitive variant of it). The only modification needed for this optimization is instead of processing whole numbers at once, process them digit-by-digit: first pass processes the low order bits of each number (like bits 0..63), second pass - next bits (like 64..127), etc. We could ignore all higher-order bits after result is found. Also this decreases space requirements to O(K) numbers, where K is number of bits in machine word.