Search code examples
pythonpandasdata-manipulation

Pandas: How do I replace a subset of column values with the same column values in a specific series?


I'm writing a Pandas script to perform data manipulation on an excel file. First, I load two sheets into dataframes. One is the original data df, the second is a sheet detailing replacements that need to be made in the original data replace.

The script needs to do two things for each row of df.

  1. Replace each instance of 'Name' in df with 'NameReplace' (working)

  2. For the same rows in df, replace a slice of the columns (specified by a list) with the values in the same slice of columns in replace

Reproducible Minimal Example of my current implementation:

import pandas

df = pandas.DataFrame([["John", None, None],["Phil", None, None],["John", None, None],["Bob", None, None]], columns=["Name", "Age", "Height"])
replace = pandas.DataFrame([["John", "Dom", 25, 175],["Phil", "Kevin", 56, 145],["Bob", "Michael", 33, 180]], columns=["Name", "NameReplace", "Age", "Height"])

detailsList = ["Age", "Height"]

for i, row in replace.iterrows():
    df.loc[df['Name'] == row['Name'], 'Name'] = row['NameReplace']
    df.loc[df['Name'] == row['NameReplace'], detailsList] = row[detailsList]

print(df)

Step 1) is working with this implementation, but the detailsList columns in df do not get populated.
The current output is

      Name  Age Height
0      Dom  NaN    NaN
1    Kevin  NaN    NaN
2      Dom  NaN    NaN
3  Michael  NaN    NaN

The desired output is

      Name  Age Height
0      Dom  25    175
1    Kevin  56    145
2      Dom  25    175
3  Michael  33    180

I've been trying for a while now, and cannot seem to make progress. I also don't really get why this doesn't work, so any insight there would be extra appreciated!

Note: Using detailsList to specify the slice of columns is necessary, as in the real solution I am only operating on a specific slice of the full dataframe, unlike the example I've given.


Solution

  • The problem is the way that pandas tries to assign a series to a whole dataframe. Anyway, here's a simple fix that leads to the intended behavior, taking advantage that pandas does the correct thing when you assign with a numpy array rather than with a series.

    for i, row in replace.iterrows():
        df.loc[df['Name'] == row['Name'], 'Name'] = row['NameReplace']
        df.loc[df['Name'] == row['NameReplace'], detailsList] = row[detailsList].values
    

    Other optimizations:

    • Note that you can reuse the df['Name'] == row['Name'] mask. In particular, you save some work with
    for i, row in replace.iterrows():
        mask = df['Name'] == row['Name']
        df.loc[mask, 'Name'] = row['NameReplace']
        df.loc[mask, detailsList] = row[detailsList].values
    
    • You can avoid iterrows if you use a merge
    df = (df[['Name']].merge(replace, on = 'Name')
                      .drop(columns='Name')
                      .rename(columns={'NameReplace':'Name'}))
    

    The catch with this approach is that the rows might end up reordered.

    • Another approach, taking advantage of indexing:
    df = (replace.set_index('Name')
                 .loc[df['Name'].values]
                 .reset_index(drop=True)
                 .rename(columns={'NameReplace': 'Name'}))