Search code examples
pythonpandashistogramcategorical-data

histogram/value counts from pandas dataframe columns with categorical data and custom "bins"


Consider the following dataframe:

import pandas as pd

x = pd.DataFrame([[ 'a', 'b'], ['a', 'c'], ['c', 'b'], ['d', 'c']])
print(x)

   0  1
0  a  b
1  a  c
2  c  b
3  d  c

I would like to obtain the relative frequencies of the data in each column of the dataframe based on some custom "bins" which would be (a possible super-set of) the unique data values. For example, if:

b = ['a', 'b', 'c', 'd', 'e', 'f']

I would like to obtain:

   0  1
a  2  0
b  0  2
c  1  2
d  1  0
e  0  0
f  0  0

Is there a one (or two) liner to achieve this?


Solution

  • Try apply value_counts, then reindex based on b:

    import pandas as pd
    
    x = pd.DataFrame([['a', 'b'], ['a', 'c'], ['c', 'b'], ['d', 'c']])
    
    b = ['a', 'b', 'c', 'd', 'e', 'f']
    df = x.apply(lambda s: s.value_counts()).reindex(b).fillna(0).astype(int)
    
    print(df)
    

    df:

       0  1
    a  2  0
    b  0  2
    c  1  2
    d  1  0
    e  0  0
    f  0  0
    

    A melt and crosstab option:

    import pandas as pd
    
    x = pd.DataFrame([['a', 'b'], ['a', 'c'], ['c', 'b'], ['d', 'c']])
    
    b = ['a', 'b', 'c', 'd', 'e', 'f']
    df = x.melt()
    df = pd.crosstab(df['value'], df['variable']) \
        .reindex(b).fillna(0).astype(int) \
        .rename_axis(None, axis=1).rename_axis(None, axis=0)
    
    print(df)
    

    df:

       0  1
    a  2  0
    b  0  2
    c  1  2
    d  1  0
    e  0  0
    f  0  0