Search code examples
pythonpandasdataframedata-analysis

how to calculate value counts when we have more than one value in a colum in pandas dataframe


df,

Name
Sri
Sri,Ram
Sri,Ram,kumar
Ram

I am trying to calculate the value counts for each value. I am not getting my output when using

 df["Name"].values_count()

my desired output is,

 Sri     3
 Ram     3
 Kumar   1

Solution

  • split the column, stack to long format, then count:

    df.Name.str.split(',', expand=True).stack().value_counts()
    
    #Sri      3
    #Ram      3
    #kumar    1
    #dtype: int64
    

    Or maybe:

    df.Name.str.get_dummies(',').sum()
    
    #Ram      3
    #Sri      3
    #kumar    1
    #dtype: int64
    

    Or concatenate before value_counts:

    pd.value_counts(pd.np.concatenate(df.Name.str.split(',')))
    
    #Sri      3
    #Ram      3
    #kumar    1
    #dtype: int64
    

    Timing:

    %timeit df.Name.str.split(',', expand=True).stack().value_counts()
    #1000 loops, best of 3: 1.02 ms per loop
    
    %timeit df.Name.str.get_dummies(',').sum()
    #1000 loops, best of 3: 1.18 ms per loop
    
    %timeit pd.value_counts(pd.np.concatenate(df.Name.str.split(',')))
    #1000 loops, best of 3: 573 µs per loop
    
    # option from @Bharathshetty 
    from collections import Counter
    %timeit pd.Series(Counter((df['Name'].str.strip() + ',').sum().rstrip(',').split(',')))
    # 1000 loops, best of 3: 498 µs per loop
    
    # option inspired by @Bharathshetty 
    %timeit pd.value_counts(df.Name.str.cat(sep=',').split(','))
    # 1000 loops, best of 3: 483 µs per loop