Search code examples
pythondataframefrequency

How to count frequency of select values in Python pandas dataframe


I have a dataframe with two columns, one for names and one for string values. I'm trying to count frequency of select string values by names.

I've tried pandas.pivot_table and pandas.DataFrame.groupby but I'd like to create a whole new dataframe rather than aggregation.

For example, I have a dataframe:

import pandas as pd
import numpy as np

data = np.array([['John', 'x'], ['John', 'x'], ['John', 'x'], ['John', 'y'], ['John', 'y'], ['John', 'a'], 
                 ['Will', 'x'], ['Will', 'z']])

df = pd.DataFrame(data, columns=['name','str_value'])
df

which results in:

   name      str_value
0  John              x
1  John              x
2  John              x
3  John              y
4  John              y
5  John              a
6  Will              x
7  Will              z

An expected result would be:

   name        x        y        z
0  John        3        2        0 
1  Will        1        0        1  

and additionally:

   name        x        y        z
0  John     True     True    False 
1  Will     True    False     True   

I'd like to select x, y, z only and return True or False based on whether the returned value is 0 or NaN.

Edit: Thank you for the answers. These work great, but the output has the subgroup "str_value":

str_value     x      y      z
name
John       True   True  False
Will       True  False   True

Is there a way to remove this so I have "name", "x", "y", "z" on the same level? With .reset_index() I get:

str_value  name     x      y      z
0          John  True   True  False
1          Will  True  False   True

Is the name of my index "str_value" now? Can I rename or drop this?


Solution

  • You can try:

    df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']].gt(0)
    

    Explanations:

    1. Count the occurence per name and str_value with groupby and size:
    print(df.groupby(["name", "str_value"]).size())
    # John  a            1
    #       x            3
    #       y            2
    # Will  x            1
    #       z            1
    # dtype: int64
    
    1. Unstack with unstack
    print(df.groupby(["name", "str_value"]).size().unstack())
    # str_value    a    x    y    z
    # name
    # John       1.0  3.0  2.0  NaN
    # Will       NaN  1.0  NaN  1.0
    
    1. Select desired columns:
    print(df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']])
    # str_value    x    y    z
    # name
    # John       3.0  2.0  NaN
    # Will       1.0  NaN  1.0
    
    1. Compare values greater than 0 with gt:
    result = df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']].gt(0)
    print(result)
    # str_value     x      y      z
    # name
    # John       True   True  False
    # Will       True  False   True