Search code examples
pythonpandaspandas-merge

Pandas DataFrame Merge Not Updating All Rows as Expected


I'm trying to merge two dataframes in Pandas based on a common column, ID. After merging, I want to update a column, Target Value, in my original dataframe (df) using a column from the second dataframe (reference). However, not all rows are being updated as expected, even though the ID values are exact match. For example, it would update only 4 out of 10 rows I have in my dataframe.

This is my original dataframe (df):

        ID  Other Column  Target Value
0  ID00123       Value A          NaN
1  ID00456       Value B          NaN
2  ID00789       Value C          NaN
3  ID01012       Value D          NaN
4  ID01345       Value E          NaN

This is my reference dataframe (reference):

        ID  Reference Value
0  ID00123         RefVal1
1  ID00456         RefVal2
2  ID00789         RefVal3
3  ID01012         RefVal4
4  ID01345         RefVal5

This is my last attempt of merging two dataframes and updating the original dataframe:

# Merge dataframes on 'ID'
merged = df.merge(reference, on='ID', how='left', indicator=True)

# Create a mask for rows to update
mask = (
    (merged['_merge'] == 'both') &  # Exists in both DataFrames
    (df['Target Value'].isna())    # 'Target Value' is NaN
)

# Update 'Target Value' in df
df.loc[mask, 'Target Value'] = merged.loc[mask, 'Reference Value']

This is how merged dataframe looks like:

        ID  Other Column  Target Value Reference Value _merge
0  ID00123       Value A          NaN         RefVal1   both
1  ID00456       Value B          NaN         RefVal2   both
2  ID00789       Value C          NaN         RefVal3   both
3  ID01012       Value D          NaN         RefVal4   both
4  ID01345       Value E          NaN         RefVal5   both

And this is how my original dataframe (df) looks like after 'updating' it:

        ID  Other Column  Target Value
0  ID00123       Value A       RefVal1
1  ID00456       Value B       RefVal2
2  ID00789       Value C          NaN  # This row remains unchanged
3  ID01012       Value D          NaN  # This row remains unchanged
4  ID01345       Value E          NaN  # This row remains unchanged

I have tried inner join as well, still only 4 out of 10 Target Values would be updated.

I have checked datatypes of columns, both were Object.

Stripped leading/trailing whitespace and converted all values to lowercase:

df['ID'] = df['ID'].str.strip().str.lower()
reference['ID'] = reference['ID'].str.strip().str.lower()

Checked for unmatched values:

unmatched_df = set(df['ID']).difference(reference['ID'])
unmatched_reference = set(reference['ID']).difference(df['ID'])
print(unmatched_df, unmatched_reference)  # Both sets are empty

Furthermore, there are no duplicate rows: there are literally only 10 rows in the entire dataframe.

This is not the first time I encountered this issue, but I have managed to work around it. The same issues. It matches everything - merge table gives me all matched rows, however, it updates only part of the matched rows. (For some reason, once it was not working on the deduplicated table, but it worked fine on the table with duplicated - and I am talking about table with over 50k records).

Thank you for help!


Solution

  • The index is not preserved during a merge. Thus if your original df doesn't have a RangeIndex, assigning values from merged_df won't align the data properly.

    You could reset_index before the merge, and set_index before assignment:

    # Merge dataframes on 'ID'
    merged = df.reset_index().merge(reference, on='ID', how='left', indicator=True)
    
    # Create a mask for rows to update
    mask = (
        (merged['_merge'] == 'both') &  # Exists in both DataFrames
        (df['Target Value'].isna())    # 'Target Value' is NaN
    ).to_numpy()
    
    # Update 'Target Value' in df
    df.loc[mask, 'Target Value'] = merged.set_index('index').loc[mask, 'Reference Value']
    

    Example output:

            ID Other Column Target Value
    0  ID00123      Value A      RefVal1
    1  ID00456      Value B      RefVal2
    2  ID00789      Value C      RefVal3
    3  ID01012      Value D      RefVal4
    4  ID01345      Value E      RefVal5