I am working with a baby names data set from: https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv
Which has data that looks like:
"year","name","percent","sex"
1880,"John",0.081541,"boy"
1880,"William",0.080511,"boy"
1880,"James",0.050057,"boy"
1880,"Charles",0.045167,"boy"
1880,"George",0.043292,"boy"
1880,"Frank",0.02738,"boy"
1880,"Joseph",0.022229,"boy"
I have grouped all the names together and summed the percentages for boy and girl with:
data1.groupby(['name','sex'])[['percent']].sum()
Which creates a multi-indexed data frame:
Name Sex Percent
Aaron boy 0.292292
girl 0.000805
Abagail girl 0.001326
Abbie boy 0.000092
girl 0.022804
For each name, I want to return the higher percentage sex in a new data frame that would like:
Name Sex Percent
Aaron boy 0.292292
Abagail girl 0.001326
Abbie girl 0.022804
I have been looking through the multi-index documentation but have not been able to figure this out. Any help is appreciated.
You can using a groupby
tail
after the groupby
sum
s=df.groupby(['name','sex'])[['percent']].sum()
s.sort_values('Percent').groupby(level=0).tail(1)