Search code examples
pythonpandasunique

Pandas difference between groupby-size and unique


The goal here is to see how many unique values i have in my database. This is the code i have written:

apps = pd.read_csv('ConcatOwned1_900.csv', sep='\t', usecols=['appid'])

apps[('appid')] = apps[('appid')].astype(int)
apps_list=apps['appid'].unique()

b = apps.groupby('appid').size()
blist = b.unique()

print len(apps_list), len(blist), len(set(b))
>>>7672 2164 2164

Why is there difference in those two methods?

Due to request i am posting some of my data:

Unnamed: 0  StudID          No  appid work work2
0   0   76561193665298433   0   10  nan 0
1   1   76561193665298433   1   20  nan 0
2   2   76561193665298433   2   30  nan 0
3   3   76561193665298433   3   40  nan 0
4   4   76561193665298433   4   50  nan 0
5   5   76561193665298433   5   60  nan 0
6   6   76561193665298433   6   70  nan 0
7   7   76561193665298433   7   80  nan 0
8   8   76561193665298433   8   100 nan 0
9   9   76561193665298433   9   130 nan 0
10  10  76561193665298433   10  220 nan 0
11  11  76561193665298433   11  240 nan 0
12  12  76561193665298433   12  280 nan 0
13  13  76561193665298433   13  300 nan 0
14  14  76561193665298433   14  320 nan 0
15  15  76561193665298433   15  340 nan 0
16  16  76561193665298433   16  360 nan 0
17  17  76561193665298433   17  380 nan 0
18  18  76561193665298433   18  400 nan 0
19  19  76561193665298433   19  420 nan 0
20  20  76561193665298433   20  500 nan 0
21  21  76561193665298433   21  550 nan 0
22  22  76561193665298433   22  620 6.0 3064
33  33  76561193665298434   0   10  nan 837
34  34  76561193665298434   1   20  nan 27
35  35  76561193665298434   2   30  nan 9
36  36  76561193665298434   3   40  nan 5
37  37  76561193665298434   4   50  nan 2
38  38  76561193665298434   5   60  nan 0
39  39  76561193665298434   6   70  nan 403
40  40  76561193665298434   7   130 nan 0
41  41  76561193665298434   8   80  nan 6
42  42  76561193665298434   9   100 nan 10
43  43  76561193665298434   10  220 nan 14

Solution

  • IIUC based on attached piece of the dataframe it seems that you should analyze b.index, not values of b. Just look:

    b = apps.groupby('appid').size()
    
    In [24]: b  
    Out[24]:    
    appid       
    10     2    
    20     2    
    30     2    
    40     2    
    50     2    
    60     2    
    70     2    
    80     2    
    100    2    
    130    2    
    220    2    
    240    1    
    280    1    
    300    1    
    320    1    
    340    1    
    360    1    
    380    1    
    400    1    
    420    1    
    500    1    
    550    1    
    620    1    
    dtype: int64
    
    In [25]: set(b)
    Out[25]: {1, 2}
    

    But if you do it for b.index you'll get the same values for all 3 methods:

    blist = b.index.unique()
    
    In [30]: len(apps_list), len(blist), len(set(b.index))
    Out[30]: (23, 23, 23)