I want to match a specific data entry (dictonary/dataframe with 20 variables) with a possible same entry of a database.
As there is no unique identifier and for some entries a lot of missing values I wanted to make a "naive" guess. Means I want to count all matching values row-wise and take the top 10 prospects.
Currently, I transform the dictionary to a list and use .isin()
to get the number of matching values.
db['no_matches'] = db.isin(list_of_criterias).sum(1)
prospects = db.nlargest(10 ['no_matches'])
However, my approach is misleading, as I count matches regardless of the column order/name.
Means, if my search value is e.g. column1 = 'foo'
, it also matches with 'foo'
values in my database that are not located in column1
.
Is there a way to count matching values row-wise and considering column order at the same time?
Thanks.
Update:
Thanks to the comment by Quang Hoang I passed the corresponding dictionary to the .isin()
function. However, I receive an type error
.
In[9]: type(clean_criteria)
Out[9]: dict
db.isin(clean_criteria) #Throws Error
TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'
Proposed/derived solution by comments (for community wiki):
dict_criteria = df_criteria.to_dict('list')
db['no_matches'] = db.isin(dict_criteria).sum(1)
prospects = db.nlargest(10 ['no_matches'])
Explanation