Desire is input DataFrame
reduced to name
and color
columns. If updated_name|color
has data, use it. If ''
(Nan
in true data set) then either use the value entered in updated_name|color
column if it exists, and if not use the value in original_name|color
column.
In real data set got about as far as the DataFrame
below using:
mask = df['updated_name'].notnull() | df['updated_color'].notnull()
I am stuck on next steps.
An example DataFrame
is worth a thousand words, minimum viable as follows:
Input DataFrame
:
sql_Key | original_name | original_color | updated_name | updated_color |
---|---|---|---|---|
1 | Happy Bear | Red | Green Bear | |
1 | Happy Bear | Red | ||
1 | Happy Bear | Red | ||
2 | Grumpy Bear | Blue | ||
2 | Grumpy Bear | Blue | Purple | |
2 | Grumpy Bear | Blue | ||
3 | Sleepy Bear | Yellow | Energy Bear | |
3 | Sleepy Bear | Yellow | ||
3 | Sleepy Bear | Yellow | Bright Yellow |
Desired Results:
sql_Key | name | color |
---|---|---|
1 | Green Bear | Red |
2 | Grumpy Bear | Purple |
3 | Energy Bear | Bright Yellow |
This works, though I feel like it could be simpler: for each field in each group, get the first non-null "updated" value, otherwise get the first non-null "original" value.
g = df.groupby('sql_Key')
pd.concat(
[
g[f'updated_{v}'].first().fillna(g[f'original_{v}'].first()).rename(v)
for v in ['name', 'color']],
axis=1)
name color
sql_Key
1 Green Bear Red
2 Grumpy Bear Purple
3 Energy Bear Bright Yellow