Search code examples
pythonpandasdataframestring-matching

Full string matching in Pandas dataframes comparison


this seems like it should be an easy problem to solve, but I've been battling with it and cannot seem to find a solution.

I have two dataframes of different sizes and different column names. I am trying to compare a column in dataframe A, with another in dataframe B, and only retain the rows in dataframe A, if the strings are an EXACT match with B. I am not looking for a partial match or sub string, but an EXACT FULL STRING match. I have checked close to 40 questions now, and still keep getting partial matches.

Dataframe A
   ID                 Text     score
0   1                admin      10.4
1   2                Care        2.0
2   3                Office      5.0
3   4                Doctor      0.5
4   5                be          0.2
5   6                to          0.9
6   7                in          0.8 

And lets assume that the second dataframe is:

Dataframe B
   ID                 Labels              Places
0   1                Office                Upper
1   2                administration       Lower
2   3                Doctor               Internal
3   4                Reception           Outer
4   5                Tone                 Outer
5   6                Skin                 Internal
6   7                Behaviour            Outer
7   8                Injury               Upper

My desired output is

Dataframe A
   ID                 Text     score
2   3               Office       5.0
3   4               Doctor       0.5

This will be from comparing DataframeA['Text'] with DataframeB['Labels'] and only keeping the exact matches.

I have tried

df_A_new = df_A[df_A['Text'].isin(df_b['Labels'])]

and the output it gives me is

   ID                 Text     score
0   1                admin      10.4
2   3               Office       5.0
3   4               Doctor       0.5
4   5                be          0.2
5   6                to          0.9
6   7                in          0.8 

it maps the substring admin and administration. Merge has not helped either.

df_a_new = df_a.merge(df_b, left_on='Text', right_on='Lables')[['ID', 'Text', 'score']]

I have checked so many answers on stackoverflow, but they all seem to be matching substrings! Can anyone help with this?


Solution

  • Here is my suggestion.

    Ensure string consistency

    • Remove all hidden characters like extra spaces, tabs, or non-printable characters.
    • Case sensitivity can be another issue. If the match should be case-insensitive, we can normalize all strings to lowercase.

    Use merge() with strict matching

    • Instead of isin(), use merge() which performs a more explicit join between DataFrames and guarantees exact matches.

    Optionally check string length

    • If substrings like be are sneaking through, we can ensure that the strings have the exact same length as an additional check.

    Here is the code.

    import pandas as pd
    
    df_A = pd.DataFrame({
        'ID': [1, 2, 3, 4],
        'Text': ['admin', 'Care', 'Office', 'Doctor'],
        'score': [10.4, 2.0, 5.0, 0.5]
    })
    
    df_B = pd.DataFrame({
        'ID': [1, 2, 3, 4],
        'Labels': ['Office', 'administration', 'Doctor', 'Reception'],
        'Places': ['Upper', 'Lower', 'Internal', 'Outer']
    })
    
    # Strip all leading/trailing spaces and normalize to lowercase (or keep case-sensitive if needed)
    df_A['Text'] = df_A['Text'].str.strip().str.lower()
    df_B['Labels'] = df_B['Labels'].str.strip().str.lower()
    
    # Ensure exact length matching (optional, if substring matches are suspected)
    df_A_new = df_A[df_A['Text'].apply(len) == df_A['Text'].apply(len)]
    
    # Use merge() for strict exact matching
    df_A_new = df_A.merge(df_B, left_on='Text', right_on='Labels', how='inner')[['ID_x', 'Text', 'score']]
    
    # Optionally rename ID column
    df_A_new.rename(columns={'ID_x': 'ID'}, inplace=True)
    
    # Display the result
    print(df_A_new)
    

    I hope this will help you a little.