I am looking for a Pandas function, that perrforms the following elementary operation given a DataFrame
consisting off two columns. I would like to obtain the conditional distribution of elements in the first column given each particular value in the second column.
Here is an example. Given:
import pandas as pd
pd.DataFrame([['a', 'b'], ['a', 'b'], ['a', 'b'], ['b', 'b'], ['b', 'b'],['a','a']])
which looks like:
0 1
0 a b
1 a b
2 a b
3 b b
4 b b
5 a a
we should obtain:
'a' 'b'
'a' 1 0.6
'b' 0 0.4
Note that the columns must sum up to 1 as these are frequency distributions.
import pandas as pd
data = pd.DataFrame([['a', 'b'], ['a', 'b'], ['a', 'b'], ['b', 'b'], ['b', 'b'],['a','a']])
#Answer:
pd.crosstab(data[0],data[1]).apply(lambda r: r/r.sum(), axis=0)
1 a b
0
a 1 0.6
b 0 0.4