Search code examples
pythonpandasdataframegroup-by

How to identify the 'most occurring text value' in a particular column in a pandas dataframe


I have a pandas dataframe, df:

Team Name   Role                Country
Mobile      Developer           USA
Mobile      Developer           USA
Mobile      Developer           USA
Mobile      Developer           USA
Mobile      Product Owner       USA
Mobile      Product Owner       USA
Mobile      Product Owner       UK
Mobile      Scrum Master        India
Mobile      Scrum Master        India
Mobile      Developer           UK
Mobile      Developer           UK
Mobile      Developer           UKtype here

Web     Developer               UK
Web     Developer               UK
Web     Developer               UK
Web     Developer               UK
Web     Product Owner           India
Web     Product Owner           India
Web     Product Owner           UK
Web     Scrum Master            USA
Web     Scrum Master            USA
Web     Developer               USA
Web     Developer               USA
Web     Developer               USA

I want to transform it using Pandas or any suitable alternative. For a given Team (Mobile/Web), I want to derive the 'most occuring country' value by the Team member 'Role'. Results Dataframe can be of any shape - Option 1 or Option 2.

I understand there will be instance swhen there will be a tie in 'most occuring country, I will treat that as a edge case for now, I will select any random first value (50 % USA, 50 UK) in case of a tie.

Result Dataframe Option 1

Team Name   Developer       Product Owner       Scrum Master
Mobile      USA             USA                 India
Web         UK              India               USA

Result Dataframe Option 2


Team Name   Role            Country (Most occuring value)
Mobile      Developer       USA
Mobile      Product Owner           USA
Mobile      Scrum Master            India 
Web     Developer       UK
Web     Product Owner           India
Web     Scrum Master            USA

I have tried grouping the dataframe, however it did not give me the desired outcome.


Solution

  • A simple way to get the highest occurring value is to use pd.Series.mode in a groupby or pivot table:

    df.groupby(["Team Name", "Role"])["Country"].agg(pd.Series.mode).unstack()
    # or
    df.pivot_table(index="Team Name", columns="Role", values="Country", aggfunc={'Country': pd.Series.mode})
    
    #Role      Developer Product Owner Scrum Master
    #Team Name                                     
    #Mobile          USA           USA        India
    #Web              UK         India          USA
    

    To take the random country if there are multiple countries that are most frequent, you could apply a function to a groupby:

    def func(x):
        val_counts = x.value_counts().sort_values(ascending=False)
        if sum(val_counts == val_counts.max()) > 1:
            return val_counts[val_counts == val_counts.max()].sample(1).index[0]
        else:
            return val_counts.index[0]
    df.groupby(["Team Name", "Role"])["Country"].apply(func).unstack()
    # or
    df.pivot_table(index="Team Name", columns="Role", values="Country", aggfunc={"Country": func})
    
    #Role      Developer Product Owner Scrum Master
    #Team Name                                     
    #Mobile          USA           USA        India
    #Web              UK         India          USA
    

    Testing for speed, using pivot_table is slower than groupby, and using the conditional if statement within the function is faster than taking a random sample of the most frequent countries even if there is only 1 (as is the case for the data you have provided above).

    To return output option 2, do not use the .unstack() at the end of the groupby code (or add .stack() to the pivot_table, which will therefore worsen comparative performance).