Search code examples
pythonpandasdataframepandas-groupby

When is it appropriate to use df.value_counts() vs df.groupby('...').count()?


I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –

If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count() and when does it make sense to use df['colA'].value_counts() ?


Solution

  • There is difference value_counts return:

    The resulting object will be in descending order so that the first element is the most frequently-occurring element.

    but count not, it sort output by index (created by column in groupby('col')).


    df.groupby('colA').count() 
    

    is for aggregate all columns of df by function count. So it count values excluding NaNs.

    So if need count only one column need:

    df.groupby('colA')['colA'].count() 
    

    Sample:

    df = pd.DataFrame({'colB':list('abcdefg'),
                       'colC':[1,3,5,7,np.nan,np.nan,4],
                       'colD':[np.nan,3,6,9,2,4,np.nan],
                       'colA':['c','c','b','a',np.nan,'b','b']})
    
    print (df)
      colA colB  colC  colD
    0    c    a   1.0   NaN
    1    c    b   3.0   3.0
    2    b    c   5.0   6.0
    3    a    d   7.0   9.0
    4  NaN    e   NaN   2.0
    5    b    f   NaN   4.0
    6    b    g   4.0   NaN
    
    print (df['colA'].value_counts())
    b    3
    c    2
    a    1
    Name: colA, dtype: int64
    
    print (df.groupby('colA').count())
          colB  colC  colD
    colA                  
    a        1     1     1
    b        3     2     2
    c        2     2     1
    
    print (df.groupby('colA')['colA'].count())
    colA
    a    1
    b    3
    c    2
    Name: colA, dtype: int64