Search code examples
pythonarraysperformancenumpyindices

Find indices of common values in two arrays


I'm using Python 2.7. I have two arrays, A and B. To find the indices of the elements in A that are present in B, I can do

A_inds = np.in1d(A,B)

I also want to get the indices of the elements in B that are present in A, i.e. the indices in B of the same overlapping elements I found using the above code.

Currently I am running the same line again as follows:

B_inds = np.in1d(B,A)

but this extra calculation seems like it should be unnecessary. Is there a more computationally efficient way of obtaining both A_inds and B_inds?

I am open to using either list or array methods.


Solution

  • np.unique and np.searchsorted could be used together to solve it -

    def unq_searchsorted(A,B):
    
        # Get unique elements of A and B and the indices based on the uniqueness
        unqA,idx1 = np.unique(A,return_inverse=True)
        unqB,idx2 = np.unique(B,return_inverse=True)
    
        # Create mask equivalent to np.in1d(A,B) and np.in1d(B,A) for unique elements
        mask1 = (np.searchsorted(unqB,unqA,'right') - np.searchsorted(unqB,unqA,'left'))==1
        mask2 = (np.searchsorted(unqA,unqB,'right') - np.searchsorted(unqA,unqB,'left'))==1
    
        # Map back to all non-unique indices to get equivalent of np.in1d(A,B), 
        # np.in1d(B,A) results for non-unique elements
        return mask1[idx1],mask2[idx2]
    

    Runtime tests and verify results -

    In [233]: def org_app(A,B):
         ...:     return np.in1d(A,B), np.in1d(B,A)
         ...: 
    
    In [234]: A = np.random.randint(0,10000,(10000))
         ...: B = np.random.randint(0,10000,(10000))
         ...: 
    
    In [235]: np.allclose(org_app(A,B)[0],unq_searchsorted(A,B)[0])
    Out[235]: True
    
    In [236]: np.allclose(org_app(A,B)[1],unq_searchsorted(A,B)[1])
    Out[236]: True
    
    In [237]: %timeit org_app(A,B)
    100 loops, best of 3: 7.69 ms per loop
    
    In [238]: %timeit unq_searchsorted(A,B)
    100 loops, best of 3: 5.56 ms per loop
    

    If the two input arrays are already sorted and unique, the performance boost would be substantial. Thus, the solution function would simplify to -

    def unq_searchsorted_v1(A,B):
        out1 = (np.searchsorted(B,A,'right') - np.searchsorted(B,A,'left'))==1
        out2 = (np.searchsorted(A,B,'right') - np.searchsorted(A,B,'left'))==1  
        return out1,out2
    

    Subsequent runtime tests -

    In [275]: A = np.random.randint(0,100000,(20000))
         ...: B = np.random.randint(0,100000,(20000))
         ...: A = np.unique(A)
         ...: B = np.unique(B)
         ...: 
    
    In [276]: np.allclose(org_app(A,B)[0],unq_searchsorted_v1(A,B)[0])
    Out[276]: True
    
    In [277]: np.allclose(org_app(A,B)[1],unq_searchsorted_v1(A,B)[1])
    Out[277]: True
    
    In [278]: %timeit org_app(A,B)
    100 loops, best of 3: 8.83 ms per loop
    
    In [279]: %timeit unq_searchsorted_v1(A,B)
    100 loops, best of 3: 4.94 ms per loop