I've recently started to play around with Pandas in order to manipulate some data and I am now trying to anonymize a few columns after a groupBy
to find unique occurrences for persons.
For example, suppose the following DF:
First Name Last Name DOB
0 Bob One 28/05/1973
1 Bob One 28/05/1973
2 Ana Two 28/07/1991
3 Ana Two 28/07/1991
4 Ana Two 28/07/1991
5 Jim Three 07/01/1994
I can easily find unique person by First Name, Last Name and DOB by using df.groupby(['First Name', 'Last Name', 'DOB'])
.
However, I'd like to apply a function to every unique combination that would transform those names to a known anonymized (incremental) version.
First Name Last Name DOB
0 F1 L1 28/05/1973
1 F1 L1 28/05/1973
2 F2 L2 28/07/1991
3 F2 L2 28/07/1991
4 F2 L2 28/07/1991
5 F3 L3 07/01/1994
I've tried a few things with transform
and apply
functions of DF groupBy but with no lucky so far. How could I achieve this?
ids = (df.groupby(["FirstName", "LastName", "DOB"], sort=False)
.ngroup().add(1)
.astype(str))
df["FirstName"] = "F" + ids
df["LastName"] = "L" + ids
identify the IDs of firstname, lastname and DOB triples over their group number (ngroup)
then add prefixes "F" and "L" to the IDs and assign to appropriate columns
to get
>>> df
FirstName LastName DOB
0 F1 L1 28/05/1973
1 F1 L1 28/05/1973
2 F2 L2 28/07/1991
3 F2 L2 28/07/1991
4 F2 L2 28/07/1991
5 F3 L3 07/01/1994
where the ids
was
>>> ids
0 1
1 1
2 2
3 2
4 2
5 3
dtype: object