I have a DataFrame with a variable number of columns. I need to calculate the proportion of each unique value of a chosen column for each unique combination of the other. For example:
import numpy as np
import pandas as pd
a="A";b="B"
df = pd.DataFrame({
"n": [a,a,a,b,a,b,b,b],
"X": [0,0,0,0,1,1,1,1],
"Y": [0,0,1,1,0,0,0,0],
})
print(df)
n X Y
0 A 0 0
1 A 0 0
2 A 0 1
3 B 0 1
4 A 1 0
5 B 1 0
6 B 1 0
7 B 1 0
Say we want to calculate the proportion of unique n
(absolute frequency n_ru
) for each unique combination of X
and Y
(absolute frequency n_u
).
Here for example, we got 3 n=B
for the combination of 4 (X=1,Y=0)
, thus the proportion is 3/4
, etc.
I thought to do it like
# complete column list
col = list(df.columns.values)
# column list except n
cov = list(df.columns[1:].values)
# merge absolute frequencies
count = pd.merge(
# absolute freq of each (X,Y)
df.groupby(cov).count(),
# absolute freq of n for each (X,Y)
df.groupby(col).aggregate("n").count(),
# options
on=cov, suffixes=["_u", "_ru"]
)
print(count)
# calculate ell metric
ell = np.sum(
np.log(count["n_ru"]/count["n_u"])
)
print(f"ell = {ell:.3f}")
n_ru n_u
X Y
0 0 2 2
1 1 2
1 1 2
1 0 1 4
0 3 4
ell = -3.060
Is there a better way to do it?
Use DataFrame.value_counts
by all columns with divide Series by Series.div
with count columns specified in list cov
, so merge
is not necessary here:
ell = np.sum(np.log(df.value_counts().div(df[cov].value_counts())))
print(f"ell = {ell:.3f}")
ell = -3.060
If need first count
output:
count = (df.value_counts()
.to_frame('n_ru')
.droplevel(0)
.assign(n_u=lambda x: df[cov].value_counts()))
print (count)
n_ru n_u
X Y
1 0 3 4
0 0 2 2
1 1 2
1 0 1 4
0 1 1 2
Another idea is aggregate sum
:
count = (df.value_counts()
.to_frame('n_ru')
.droplevel(0)
.assign(n_u=lambda x: x.groupby(cov)['n_ru'].sum()))
print (count)
n_ru n_u
X Y
1 0 3 4
0 0 2 2
1 1 2
1 0 1 4
0 1 1 2