Search code examples
pythonpandasdataframegroup-byunique

How to group by unique values pandas groupby


Before I ask my question, I want it to be known that I looked at the following page but it did not return what I need specifically:

Count unique values using pandas groupby

Let's say I have the following df of four individuals trying to guess a code. Each individual has two guesses:

df = pd.DataFrame({'name':['Sally', 'John', 'Lucy', 'Mark','Sally', 'John', 'Lucy', 'Mark'], 
                   'guess':['123', '321', '213', '312', '321', '231', '123', '312']})

df

    name    guess
0   Sally   123
1   John    321
2   Lucy    213
3   Mark    312
4   Sally   321
5   John    231
6   Lucy    123
7   Mark    312

I want to know how many completely unique guesses each individual has. That is, I don't want to know how many unique guesses each individual has out of their own guesses, rather, I want to know how many unique guesses they have out of all guesses. Let me elaborate.

Using the code from the post linked above, this is what I get:

df.groupby('name')[['guess']].nunique()


      guess
name    
John    2
Lucy    2
Mark    1
Sally   2

This returns how many unique guesses each individual has when compared to their own guesses. Again, what I am looking for is how many unique guesses each individual has out of all total guesses (aka the entire coulmn). This is the output I am looking for:

      guess count
name    
John    1     2
Lucy    1     2
Mark    0     2
Sally   0     2

Because one of John's guesses (231) and one of Lucy's guesses (213) are unique out of all guesses. It would also be nice to have a column showing each individuals total guess count.

Thank you in advance!


Solution

  • You can first find out which guesses were unique by grouping by guess, then just doing a grouped count and sum on name afterwards gives you the final output:

    In [64]: df['unique'] = df['guess'].map(df.groupby("guess").count()['name'] == 1).astype(int)
    
    In [65]: df.groupby("name")['unique'].agg(['sum', 'count']).rename(columns={'sum': 'guess'})
    Out[65]:
           guess  count
    name
    John     1      2
    Lucy     1      2
    Mark     0      2
    Sally    0      2