Search code examples
pythonpandasdataframematchsimilarity

Count matching values between dictionary and dataframe considering keys / column order in python


I want to match a specific data entry (dictonary/dataframe with 20 variables) with a possible same entry of a database.

As there is no unique identifier and for some entries a lot of missing values I wanted to make a "naive" guess. Means I want to count all matching values row-wise and take the top 10 prospects.

Currently, I transform the dictionary to a list and use .isin() to get the number of matching values.

db['no_matches'] = db.isin(list_of_criterias).sum(1)
prospects = db.nlargest(10 ['no_matches'])

However, my approach is misleading, as I count matches regardless of the column order/name.

Means, if my search value is e.g. column1 = 'foo', it also matches with 'foo' values in my database that are not located in column1.

Is there a way to count matching values row-wise and considering column order at the same time?

Thanks.

Update:

Thanks to the comment by Quang Hoang I passed the corresponding dictionary to the .isin() function. However, I receive an type error.

In[9]: type(clean_criteria)
Out[9]: dict

db.isin(clean_criteria) #Throws Error

TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'

Solution

  • Proposed/derived solution by comments (for community wiki):

    dict_criteria = df_criteria.to_dict('list') 
    
    db['no_matches'] = db.isin(dict_criteria).sum(1)  
    prospects = db.nlargest(10 ['no_matches'])
    

    Explanation

    • .to_dict('list') -- 'list' argument transforms dict values from skalar to list/array object
    • .isin() -- passing a 'list' is matching any value not considering order while passing a dictionary does