I have a pandas dataframe, df:
Team Name Role Country
Mobile Developer USA
Mobile Developer USA
Mobile Developer USA
Mobile Developer USA
Mobile Product Owner USA
Mobile Product Owner USA
Mobile Product Owner UK
Mobile Scrum Master India
Mobile Scrum Master India
Mobile Developer UK
Mobile Developer UK
Mobile Developer UKtype here
Web Developer UK
Web Developer UK
Web Developer UK
Web Developer UK
Web Product Owner India
Web Product Owner India
Web Product Owner UK
Web Scrum Master USA
Web Scrum Master USA
Web Developer USA
Web Developer USA
Web Developer USA
I want to transform it using Pandas or any suitable alternative. For a given Team (Mobile/Web), I want to derive the 'most occuring country' value by the Team member 'Role'. Results Dataframe can be of any shape - Option 1 or Option 2.
I understand there will be instance swhen there will be a tie in 'most occuring country, I will treat that as a edge case for now, I will select any random first value (50 % USA, 50 UK) in case of a tie.
Result Dataframe Option 1
Team Name Developer Product Owner Scrum Master
Mobile USA USA India
Web UK India USA
Result Dataframe Option 2
Team Name Role Country (Most occuring value)
Mobile Developer USA
Mobile Product Owner USA
Mobile Scrum Master India
Web Developer UK
Web Product Owner India
Web Scrum Master USA
I have tried grouping the dataframe, however it did not give me the desired outcome.
A simple way to get the highest occurring value is to use pd.Series.mode in a groupby or pivot table:
df.groupby(["Team Name", "Role"])["Country"].agg(pd.Series.mode).unstack()
# or
df.pivot_table(index="Team Name", columns="Role", values="Country", aggfunc={'Country': pd.Series.mode})
#Role Developer Product Owner Scrum Master
#Team Name
#Mobile USA USA India
#Web UK India USA
To take the random country if there are multiple countries that are most frequent, you could apply a function to a groupby:
def func(x):
val_counts = x.value_counts().sort_values(ascending=False)
if sum(val_counts == val_counts.max()) > 1:
return val_counts[val_counts == val_counts.max()].sample(1).index[0]
else:
return val_counts.index[0]
df.groupby(["Team Name", "Role"])["Country"].apply(func).unstack()
# or
df.pivot_table(index="Team Name", columns="Role", values="Country", aggfunc={"Country": func})
#Role Developer Product Owner Scrum Master
#Team Name
#Mobile USA USA India
#Web UK India USA
Testing for speed, using pivot_table
is slower than groupby
, and using the conditional if
statement within the function is faster than taking a random sample of the most frequent countries even if there is only 1 (as is the case for the data you have provided above).
To return output option 2, do not use the .unstack()
at the end of the groupby code (or add .stack()
to the pivot_table, which will therefore worsen comparative performance).