python pandas dataframe numpy data-cleaning

Change the values in a column because of repeated rows

My dataframe (a large one which has been simplified for illustration) is of the form:

  |  id  | duplicate_id | Country | action_code | keep
  |------|--------------|---------|-------------|-----
  |  1   |              |  Norway |   1.1       |  y
  |  2   |              |  Norway |   1.1       |  y
  |  3   |              |  Norway |   1.2       |  y
  |  4   |              | Iceland |   2.0       |  y
  |  5   |              | Iceland |   2.0       |  y
  |  6   |              | Iceland |   2.0       |  y

and instead of removing the duplicate rows I want to almost keep the original dataframe, but populate the column 'duplicate id' of the non-unique rows with the same entry as the id of the unique rows (i.e. the id from the first column). Also, for those non-unique rows, the columns 'action_code' will change to '12', and 'keep' will change to 'n'. The final table is expected to read:

  |  id  | duplicate_id | Country | action_code | keep 
  |------|--------------|---------|-------------|-----
  |  1   |              |  Norway |   1.1       |  y
  |  2   |     1        |  Norway |   12        |  n
  |  3   |              |  Norway |   1.2       |  y
  |  4   |              | Iceland |   2.0       |  y
  |  5   |     4        | Iceland |   12        |  n
  |  6   |     4        | Iceland |   12        |  n

Can this be done using the 'where' method in numpy, or is there a more direct approach?

Solution

A groupby with a custom summarize function will do the job nicely:

def summarize(group):
    # Everything inside the `group` dataframe is
    # a duplicate of its first row
    cond = group.index != group.index[0]
    group.loc[cond, 'duplicate_id'] = group.iloc[0]['id']
    group.loc[cond, 'action_code'] = '12'
    group.loc[cond, 'keep'] = 'n'
    return group

df.groupby(['Country', 'action_code']).apply(summarize)