Search code examples
pythonpandasdataframevalueerrorshift

Rename a column if there is duplicate based on value in another column


I have a dataframe like the following:

df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
                   "Col2": [18, 23, 13, 33, 48],
                   "Col3": [17, 27, 22, 37, 52]})

My goal is if there are duplicated values in Col1, I would then sort (only the duplicate values) by the values in Col2 from smallest to largest, and rename the original "Value" in Col1 to be "Value.A" (for duplicates with smallest value in Col2) "Value.B" (for 2nd smallest, etc). Value of the Col3

Using the example above, this is what I should end up with:

pd.DataFrame({"Col1": ["AA.B", "AB", "AA.A", "CC", "FF"],
                   "Col2": [18, 23, 13, 33, 48],
                   "Col3": [17, 27, 22, 37, 52]})

Since 13<18 so the 2nd "AA" becomes "AA.A" and first "AA" becomes AA.B. (values in Col3 stays unchanged). Also, "AB","CC","FF" all needs to remain unchanged. I could have potentially more than 1 sets of duplicates in Col1.

I do not need to preserve the rows, so long as the values in each row stay the same except the renamed value in Col1. (i.e., I should still have "AA.B", 18, 17 for the 3 columns no matter where the 1st row in the output moves to).

I tried to use the row['Col1'] == df['Col1'].shift() as a lambda function but this gives me the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I suspect this was due to the na value when I called shift() but using fillna() doesn't help since that will always create a duplicate at the beginning.

Any suggestions on how I can make it work?


Solution

  • You can use the pandas Groupby.apply with a custom function to get what you want.

    You group the dataframe by the first column and then apply you custom function to each "sub" dataframe. In this case I check if there is a duplicate and if so I use some sorting and ASCII value switching to generate the new labels that you need.

    # your example
    df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
                       "Col2": [18, 23, 13, 33, 48],
                       "Col3": [17, 27, 22, 37, 52]})
    
    def de_dup_labels(x):
        """A function to de-duplicate labels based on column orders"""
        # if at least 1 duplicate exists
        if len(x) > 1:
            # work out what order Col2 would be in if it were sorted
            order = x["Col2"].argsort().values
            
            # update Col1 with the new labels using the order from above
            x["Col1"] = np.asarray([f"{x['Col1'].iloc[i]}.{chr(ord('A') + i)}"
                                    for i in range(len(x))])[order]
        return x
    
    updated_df = df.groupby("Col1").apply(de_dup_labels)