My dataframe (a large one which has been simplified for illustration) is of the form:
| id | duplicate_id | Country | action_code | keep
|------|--------------|---------|-------------|-----
| 1 | | Norway | 1.1 | y
| 2 | | Norway | 1.1 | y
| 3 | | Norway | 1.2 | y
| 4 | | Iceland | 2.0 | y
| 5 | | Iceland | 2.0 | y
| 6 | | Iceland | 2.0 | y
and instead of removing the duplicate rows I want to almost keep the original dataframe, but populate the column 'duplicate id' of the non-unique rows with the same entry as the id of the unique rows (i.e. the id from the first column). Also, for those non-unique rows, the columns 'action_code' will change to '12', and 'keep' will change to 'n'. The final table is expected to read:
| id | duplicate_id | Country | action_code | keep
|------|--------------|---------|-------------|-----
| 1 | | Norway | 1.1 | y
| 2 | 1 | Norway | 12 | n
| 3 | | Norway | 1.2 | y
| 4 | | Iceland | 2.0 | y
| 5 | 4 | Iceland | 12 | n
| 6 | 4 | Iceland | 12 | n
Can this be done using the 'where' method in numpy, or is there a more direct approach?
A groupby with a custom summarize function will do the job nicely:
def summarize(group):
# Everything inside the `group` dataframe is
# a duplicate of its first row
cond = group.index != group.index[0]
group.loc[cond, 'duplicate_id'] = group.iloc[0]['id']
group.loc[cond, 'action_code'] = '12'
group.loc[cond, 'keep'] = 'n'
return group
df.groupby(['Country', 'action_code']).apply(summarize)