Let's suppose I have these two dataframes with the same number of columns, but possibly different number of rows:
tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp)
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
print(df2)
a b c
0 3 4 5
1 100 4 100
2 101 100 3
I want to verify if the rows of df2
are matching any rows of df
, that is I want to obtain a series (or an array) of boolean values that gives this result:
0 True
1 False
2 False
dtype: bool
I think something like the isin
method should work, but I got this result, which results in a dataframe and is wrong:
print(df2.isin(df))
a b c
0 False False False
1 False False False
2 False False False
As a constraint, I wish to not use the merge
method, since what I am doing is in fact a check on the data before applying merge itself.
Thank you for your help!
You can use numpy.isin
, which will compare all elements in your arrays and return True
or False
for each element for each array.
Then using all()
on each array, will get your desired output as the function returns True
if all elements are true:
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
0 True
1 False
2 False
dtype: bool
Breakdown of what is happening:
# np.isin
>>> np.isin(df2.values,df.values)
Out[139]:
array([[ True, True, True],
[False, True, False],
[False, False, True]])
# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]
Out[140]: [True, False, False]
# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
Out[141]:
0 True
1 False
2 False
dtype: bool