Search code examples
pythonpandasdataframenumpydata-cleaning

Change the values in a column because of repeated rows


My dataframe (a large one which has been simplified for illustration) is of the form:

  |  id  | duplicate_id | Country | action_code | keep
  |------|--------------|---------|-------------|-----
  |  1   |              |  Norway |   1.1       |  y
  |  2   |              |  Norway |   1.1       |  y
  |  3   |              |  Norway |   1.2       |  y
  |  4   |              | Iceland |   2.0       |  y
  |  5   |              | Iceland |   2.0       |  y
  |  6   |              | Iceland |   2.0       |  y

and instead of removing the duplicate rows I want to almost keep the original dataframe, but populate the column 'duplicate id' of the non-unique rows with the same entry as the id of the unique rows (i.e. the id from the first column). Also, for those non-unique rows, the columns 'action_code' will change to '12', and 'keep' will change to 'n'. The final table is expected to read:

  |  id  | duplicate_id | Country | action_code | keep 
  |------|--------------|---------|-------------|-----
  |  1   |              |  Norway |   1.1       |  y
  |  2   |     1        |  Norway |   12        |  n
  |  3   |              |  Norway |   1.2       |  y
  |  4   |              | Iceland |   2.0       |  y
  |  5   |     4        | Iceland |   12        |  n
  |  6   |     4        | Iceland |   12        |  n

Can this be done using the 'where' method in numpy, or is there a more direct approach?


Solution

  • A groupby with a custom summarize function will do the job nicely:

    def summarize(group):
        # Everything inside the `group` dataframe is
        # a duplicate of its first row
        cond = group.index != group.index[0]
        group.loc[cond, 'duplicate_id'] = group.iloc[0]['id']
        group.loc[cond, 'action_code'] = '12'
        group.loc[cond, 'keep'] = 'n'
        return group
    
    df.groupby(['Country', 'action_code']).apply(summarize)