Say I have two arrays np1
and np2
with different row sizes and column sizes. np1
is known to contain all the elements of np2
. Now I want to compare them row-wise, to know which rows of np1
contains the rows of np2
. More precisely, a row in np1
may contain some elements of a row in np2
. But what I need are the rows that contain all the elements of a certain row in np2
. And for all the rows in np2
, there is a row in np1
that contains all its elements.
So the input will be np1
with dimension (n1,m1)
, np2
with dimension (n2,m2)
where n1>n2
, m1=4
, and m2=3
.
Output: an array with dimension (n2*1)
. (Duplicates are possible, meaning a row in np1
may contains 2 rows in np2
.)
The data size that I am dealing with is rather big so I would like to have some efficient numpy or pandas solution.
For example:
A = [[1,2,4,3],[10,20,30,40],[100,200,28,16],[200,4,20,39]]
B = [[10,30,20],[200,28,16],[1,2,3]]
the result needs to be [1,2,0]
.
I have already tried the np.isin and np.sort to reduce the size of np1 and then compare. But it gets complicated when I want to know their original indices.
For a duplicate example, a row in np1
can contain two or more(in general two most) rows in np2
. But for a row in np2
is contained by one and only one row in np1
.
Say now
A = [[1,2,4,3],[10,20,30,40],[100,200,28,16],[200,4,20,39]]
B = [[10,30,20],[200,28,16],[20,30,40],[1,2,3]]
the result is now [1,2,1,0]
.
Here is a pure python solution, worst case is O(n²), best case is O(n):
from collections import Counter
cnt = list(map(Counter, A))
out = [next(i for i, a in enumerate(cnt) if b <= a)
for b in list(map(Counter, B))]
Output: [1, 2, 0]