Search code examples
pythonnumpyoptimizationsetvectorization

Vectorized relative complement of sets in numpy


I have np.arange(n) A and a numpy array B of its non-intersecting subarrays - division of the initial array into k arrays of consecutive numbers.
One example would be:

A = [0, 1, 2, 3, 4, 5, 6]
B = [[0, 1], [2, 3, 4], [5, 6]]

For every subarray C of B I have to calculate A\C (where \ is operation on sets, so the result is a numpy array of all elements of A which are not in B). My current solution hits time limit:

import numpy as np
for C in B:
    ans.append(np.setdiff1d(A, C))
return ans

I'd like to speed up it by using vectorization, but I have no idea how to. I've tried to remove the cycle, leaving only functions like setxor1d and setdiff1d, but failed.


Solution

  • I assume A and the subarrays of B are sorted and have unique elements. Then for my below example of 10**6 integers divided into 100 subarrays generated by the following code.

    np.random.seed(0)
    A = np.sort(np.unique(np.random.randint(0,10**10,10**6)))
    B = np.split(A, np.sort(np.random.randint(0,10**6-1,99)))
    

    You can cut the time in half by setting unique=True. And cut that time by a factor of 3 on top of that by only doing the setminus in for the numbers in A that lie between the biggest and smallest number in the particular subset of B. I realize that my example is the optimal case for this optimization to help so am not sure how that will be for your real world example. You will have to try.

    boundaries = [x[i] for x in B for i in [0,-1]]
    boundary_idx = np.searchsorted(A, boundaries).reshape(-1,2)
    [np.concatenate([A[:x[0]], 
                     np.setdiff1d(A[x[0]:x[1]+1], b, assume_unique=True), 
                     A[x[1]+1:]])
             for b,x in zip(B, boundary_idx)]