Search code examples
pythonpandasgroup-bypandas-groupbyunique

% of Unique Values by Group


I want to create the below column Stories_% that categorizes unique value %'s.

    SubID   Stories   Stories_%
0   102F2   2         1 (20%), 2 (80%)
1   102F2   2         1 (20%), 2 (80%)
2   102F2   2         1 (20%), 2 (80%)
3   102F2   2         1 (20%), 2 (80%)
4   102F2   1         1 (20%), 2 (80%)
5   132F2   2         1 (60%), 2 (40%)
6   132F2   1         1 (60%), 2 (40%)
7   132F2   1         1 (60%), 2 (40%)
8   132F2   2         1 (60%), 2 (40%)
9   132F2   1         1 (60%), 2 (40%)

I would drop the duplicates and drop Stories so the final table would look like this:

    SubID   Stories_%
0   102F2   1 (20%), 2 (80%)
1   132F2   1 (60%), 2 (40%)

I started by creating a list with all of the unique values (below):

unique_stories = df.groupby(["SubID"])['Stories'].unique().astype(int)
unique_stories
SubID
001C5        ['2']
001C6        ['2']
002I2        ['2']
004C6        ['2']
005L2        ['2']
           ...    
709E1    ['1' '2']
725E1    ['2' '1']
730E1    ['2' '1']

The issue I'm having is getting the %s for each unique count.

I've tried...

df.groupby(['SubID'])['Stories'].count()

...but this only creates a total count rather than a count of each unique value within that category.

I've researched a number of posts but can only turn up total counts, nunique counts, and count of unique values over the entire column, not by group etc.

Is there a method for doing this?


Solution

  • Method 1: You can use groupby + value_counts:

    s = df.groupby('SubID')['Stories']\
          .value_counts(normalize=True, sort=False).map('{:.0%}'.format)
    
    (s.index.get_level_values(1).astype(str) + ' (' + s + ')')\
       .groupby(level=0).agg(','.join).reset_index(name='Stories %')
    

    Method 2: Or you can use Counter from collections:

    def normalize():
        d = {}
        for i, g in df.groupby('SubID'):
            c = Counter(g['Stories'])
            d[i] = ', '.join(f'{k} ({v / sum(c.values()):.0%})'
                             for k, v in c.items())
        return d
    
    pd.Series(normalize(), name='Stories %').rename('SubID').reset_index()
    

       SubID        Stories %
    0  102F2  1 (20%),2 (80%)
    1  132F2  1 (60%),2 (40%)