Search code examples
pandasdataframeunique-values

Pandas count unique values for list of values


I am trying to figure out, how to count unique values in some column depending on another value. My DataFrame looks like this:

   id_user  id_track  
       1        1           
       1        2 
       1        4
       3        1
       3        1    
       3        4
       1        1
       2        5

Basically I have a table with id's of users and songs they listened to. I want to count for each user, how many unique songs did he listen and to sort it by this value. The output should be like this:

id_user uniqueTracks
   1         3
   3         2
   2         1

I tried to do this in this way (hearings is my DataFrame):

uniqueTracks=[]  #list of numbers of unique tracks
for i in range(len(hearings['id_user'].unique())):
     uniqueTracks.append(len(hearings[hearings['id_user']==i['titles'].unique()))           

but it works terribly slow for table of 27 million rows and 70k unique users. Does anyone have a clue how to do this in Pandas? Thank you in advance:)


Solution

  • Here's a NumPy approach meant for performance -

    def nunique_groupby_col0_in_col1(a):
        b = a[np.lexsort(a[:,::-1].T)]
    
        m = np.r_[True, b[1:,1] != b[:-1,1]]
        split_idx = np.r_[0, np.flatnonzero(b[1:,0] != b[:-1,0])+1]
        m[split_idx] = 1
        count = np.add.reduceat(m,split_idx)
        userIDs = b[split_idx,0]
    
        sidx = count.argsort()[::-1]
        out_data = np.column_stack(( userIDs, count ))[sidx]
        return out_data
    

    Sample run -

    In [69]: df
    Out[69]: 
       id_user  id_track
    0        1         1
    1        1         2
    2        1         4
    3        3         1
    4        3         1
    5        3         4
    6        1         1
    7        2         5
    
    In [70]: out_data = nunique_groupby_col0_in_col1(df.values)
        ...: cnames = list(['id_user','uniqueTracks'])
        ...: dfout = pd.DataFrame(out_data,columns=cnames)
        ...: 
    
    In [71]: dfout
    Out[71]: 
       id_user  uniqueTracks
    0        1             3
    1        3             2
    2        2             1