Search code examples
pandaspandas-groupbypython-dedupe

How do I apply the findings of a Pandas GroupBy to the source data


I'm doing a name de-dupe using pandas de_dupe and have multiple steps.

Firstly I train and de-dupe the source data.

deDupedNames = dedupe_dataframe( sourceData, columnsOfInterest, config_name=configName)

Next I discard data sets where the cluster has only 1 participant

dedupedComplexSets = dedupe_df_sorted.groupby( ['cluster id']).filter(lambda x: len(x) > 1)

Next I need to examine each group of matches (grouped by 'cluster id') and confirm that at least the first 3 characters of the names in each group are the same. I'm doing this by iterating through each group in the dedupedComplexSets and further grouping the content of each group by the first three characters of each Surname.

for name, group in dedupedComplexSetsGrouped:
    bySurnamePrefix = group.groupby(group.Surname.str[:3]).size()

Finally I'd like to flag each row that belongs to a de-duped cluster where the number of Surname 'begins withs' is > 1

for name, group in dedupedComplexSetsGrouped:
    bySurnamePrefix = group.groupby(group.Surname.str[:3]).size()

    if( len( bySurnamePrefix) > 1):
        dedupedComplexSets[group, 'RowClusterHasLeadingCharacterMismatch'] = True

However, I can't write back to the original dataframe due to the 'mutable hash' error or other errors.

How is a problem like this solved? And how is the output from examination of groups communicated outside of the Grouped Set dataframe? There must be a correct way...?

Example data in (where RowClusterHasLeadingCharacterMismatch is a scripted column):

RowID|FirstName|Surname

12345, fred, surname, false, 
24385, frred, surname, false, 

Example data out: RowID|FirstName|Surname|cluster id|confidence|RowClusterHasLeadingCharacterMismatch

12345, fred, surname, false, 1, .9999995, True
24385, frred, surname, false, 1, .999992, True

Note that I'm using the RowClusterHasLeadingCharacterMismatch as a way of recording the mismatch. Perhaps there is a more effective way to do this?


Solution

  • Answer from Jezrael as seen in comments above:

    Replace: dedupedComplexSets[group, 'RowClusterHasLeadingCharacterMismatch'] = True to

    with

    dedupedComplexSets.loc[group.index, 'RowClusterHasLeadingCharacterMismatch'] = True
    

    My commentary: the changes made to dedupedComplexSets will be reflected in dedupedComplexSets and can be persisted to CSV.