Search code examples
pandasdataframegroup-by

Refactor DataFrame to fill NaNs with data from other column, combine rows


Desire is input DataFrame reduced to name and color columns. If updated_name|color has data, use it. If '' (Nan in true data set) then either use the value entered in updated_name|color column if it exists, and if not use the value in original_name|color column.

In real data set got about as far as the DataFrame below using:

mask = df['updated_name'].notnull() | df['updated_color'].notnull()

I am stuck on next steps.

An example DataFrame is worth a thousand words, minimum viable as follows:

Input DataFrame:

sql_Key original_name original_color updated_name updated_color
1 Happy Bear Red Green Bear
1 Happy Bear Red
1 Happy Bear Red
2 Grumpy Bear Blue
2 Grumpy Bear Blue Purple
2 Grumpy Bear Blue
3 Sleepy Bear Yellow Energy Bear
3 Sleepy Bear Yellow
3 Sleepy Bear Yellow Bright Yellow

Desired Results:

sql_Key name color
1 Green Bear Red
2 Grumpy Bear Purple
3 Energy Bear Bright Yellow

Solution

  • This works, though I feel like it could be simpler: for each field in each group, get the first non-null "updated" value, otherwise get the first non-null "original" value.

    g = df.groupby('sql_Key')
    
    pd.concat(
        [
            g[f'updated_{v}'].first().fillna(g[f'original_{v}'].first()).rename(v)
            for v in ['name', 'color']],
        axis=1)
    
                    name          color
    sql_Key                            
    1         Green Bear            Red
    2        Grumpy Bear         Purple
    3        Energy Bear  Bright Yellow