Search code examples
pythonpandasnumpypython-3.9

Find indexes of rows from an array that contains the rows of another array


Say I have two arrays np1 and np2 with different row sizes and column sizes. np1 is known to contain all the elements of np2. Now I want to compare them row-wise, to know which rows of np1 contains the rows of np2. More precisely, a row in np1 may contain some elements of a row in np2. But what I need are the rows that contain all the elements of a certain row in np2. And for all the rows in np2, there is a row in np1 that contains all its elements.

So the input will be np1 with dimension (n1,m1), np2 with dimension (n2,m2) where n1>n2, m1=4, and m2=3.

Output: an array with dimension (n2*1). (Duplicates are possible, meaning a row in np1 may contains 2 rows in np2.)

The data size that I am dealing with is rather big so I would like to have some efficient numpy or pandas solution.

For example:

A = [[1,2,4,3],[10,20,30,40],[100,200,28,16],[200,4,20,39]]

B = [[10,30,20],[200,28,16],[1,2,3]]

the result needs to be [1,2,0].

I have already tried the np.isin and np.sort to reduce the size of np1 and then compare. But it gets complicated when I want to know their original indices.

For a duplicate example, a row in np1 can contain two or more(in general two most) rows in np2. But for a row in np2 is contained by one and only one row in np1.

Say now

A = [[1,2,4,3],[10,20,30,40],[100,200,28,16],[200,4,20,39]]

B = [[10,30,20],[200,28,16],[20,30,40],[1,2,3]]

the result is now [1,2,1,0].


Solution

  • Here is a pure python solution, worst case is O(n²), best case is O(n):

    from collections import Counter
    
    cnt = list(map(Counter, A))
    
    out = [next(i for i, a in enumerate(cnt) if b <= a)
           for b in list(map(Counter, B))]
    

    Output: [1, 2, 0]