I have a question about grouping pandas DataFrames by multiple columns. I am looking at some data for a TV show and trying to ensure that no season has two contestants with the same name.
Series | Name |
---|---|
1 | David |
1 | Edward |
1 | Jasmine |
2 | Lea |
2 | Jonathan |
2 | Louise |
I want a unique count for groupings of Series + Name, which works well when the Series contains a numeric data type. I can do:
df.groupby(['Series','Name'])['Name'].count()
and get
Series | Name | Count |
---|---|---|
1 | David | 1 |
1 | Edward | 1 |
1 | Jasmine | 1 |
2 | Lea | 1 |
2 | Jonathan | 1 |
2 | Louise | 1 |
However, if series is set to a categorical data type then
df.groupby(['Series','Name'])['Name'].count()
returns the following table
Series | Name | Count |
---|---|---|
1 | David | 1 |
2 | David | 0 |
1 | Edward | 1 |
2 | Edward | 0 |
1 | Jasmine | 1 |
2 | Jasmine | 0 |
1 | Jonathan | 0 |
2 | Jonathan | 1 |
1 | Lea | 0 |
2 | Lea | 1 |
1 | Louise | 0 |
2 | Louise | 1 |
Panda groups every possible combination of series and names and then sorts alphanumerically. I don't understand why. Any help would be most appreciated.
Actually, this is the default behaviour of Pandas when grouping categorical value, it adds missing categories (checkout this thread). To group only the observed categories on the dataframe you can use:
df.groupby(['Series','Name'],observed=True)['Name'].count()