Search code examples
pythonpython-3.xpandasdataframefrequency-distribution

Summarizing frequencies across two columns with Pandas


I am looking for a Pandas function, that perrforms the following elementary operation given a DataFrame consisting off two columns. I would like to obtain the conditional distribution of elements in the first column given each particular value in the second column.

Here is an example. Given:

import pandas as pd
pd.DataFrame([['a', 'b'], ['a', 'b'], ['a', 'b'], ['b', 'b'], ['b', 'b'],['a','a']])

which looks like:

   0  1
0  a  b
1  a  b
2  a  b
3  b  b
4  b  b
5  a  a

we should obtain:

    'a' 'b'
'a'  1   0.6
'b'  0   0.4

Note that the columns must sum up to 1 as these are frequency distributions.


Solution

  • import pandas as pd
    data = pd.DataFrame([['a', 'b'], ['a', 'b'], ['a', 'b'], ['b', 'b'], ['b', 'b'],['a','a']])
    
    #Answer:
    pd.crosstab(data[0],data[1]).apply(lambda r: r/r.sum(), axis=0)
    
    
    1   a   b
    0       
    a   1   0.6
    b   0   0.4