Search code examples
pythonpandasgroup-by

Panda rename rows after grouping by columns


I've recently started to play around with Pandas in order to manipulate some data and I am now trying to anonymize a few columns after a groupBy to find unique occurrences for persons.

For example, suppose the following DF:

   First Name Last Name         DOB
0  Bob        One               28/05/1973
1  Bob        One               28/05/1973
2  Ana        Two               28/07/1991
3  Ana        Two               28/07/1991
4  Ana        Two               28/07/1991
5  Jim        Three             07/01/1994

I can easily find unique person by First Name, Last Name and DOB by using df.groupby(['First Name', 'Last Name', 'DOB']).

However, I'd like to apply a function to every unique combination that would transform those names to a known anonymized (incremental) version.

   First Name Last Name         DOB
0  F1         L1                28/05/1973
1  F1         L1                28/05/1973
2  F2         L2                28/07/1991
3  F2         L2                28/07/1991
4  F2         L2                28/07/1991
5  F3         L3                07/01/1994

I've tried a few things with transform and apply functions of DF groupBy but with no lucky so far. How could I achieve this?


Solution

  • ids = (df.groupby(["FirstName", "LastName", "DOB"], sort=False)
             .ngroup().add(1)
             .astype(str))
    
    df["FirstName"] = "F" + ids
    df["LastName"]  = "L" + ids
    
    • identify the IDs of firstname, lastname and DOB triples over their group number (ngroup)

      • sort=False helps it retain the seen order
      • ngroup is 0-based, so we add(1)
    • then add prefixes "F" and "L" to the IDs and assign to appropriate columns

    to get

    >>> df
      FirstName LastName         DOB
    0        F1       L1  28/05/1973
    1        F1       L1  28/05/1973
    2        F2       L2  28/07/1991
    3        F2       L2  28/07/1991
    4        F2       L2  28/07/1991
    5        F3       L3  07/01/1994
    

    where the ids was

    >>> ids
    0    1
    1    1
    2    2
    3    2
    4    2
    5    3
    dtype: object