Search code examples
python-3.xpandaspandas-groupbyfuzzywuzzy

How do I get additional column name information in a pandas group by / nlargest calculation?


I am comparing pairs of strings using six fuzzywuzzy ratios, and I need to output the top three scores for each pair.

This line does the job:

final2_df = final_df[['nameHiringOrganization', 'mesure', 'name', 'valeur']].groupby(['nameHiringOrganization', 'name'])['valeur'].nlargest(3)

However, the excel output table lacks the 'mesure' column, which contains the ratio's name. This is annoying, because then I'm not able to identify which of the six ratios works best for any given pair.

I thought selecting columns ath the beginning might work (final_df[['columns', ...]]), but it doesn't seem to.

Any thought on how I might add that info?

Many thanks in advance!


Solution

  • I think here is possible use another solution with sorting by 3 columns with DataFrame.sort_values and then using GroupBy.head:

    final2_df = (final_df.sort_values(['nameHiringOrganization', 'name', 'valeur'], 
                                       ascending=[True, True, False])
                         .groupby(['nameHiringOrganization', 'name'])
                         .head(3))