Search code examples
pythonpandasdataframecountunique

Proportion of each unique value of a chosen column for each unique combination of the other in a DataFrame


I have a DataFrame with a variable number of columns. I need to calculate the proportion of each unique value of a chosen column for each unique combination of the other. For example:

import numpy as np
import pandas as pd

a="A";b="B"
df = pd.DataFrame({
    "n": [a,a,a,b,a,b,b,b],
    "X": [0,0,0,0,1,1,1,1],
    "Y": [0,0,1,1,0,0,0,0],
})
print(df)
   n  X  Y
0  A  0  0
1  A  0  0
2  A  0  1
3  B  0  1
4  A  1  0
5  B  1  0
6  B  1  0
7  B  1  0

Say we want to calculate the proportion of unique n (absolute frequency n_ru) for each unique combination of X and Y (absolute frequency n_u). Here for example, we got 3 n=B for the combination of 4 (X=1,Y=0), thus the proportion is 3/4, etc.

I thought to do it like

# complete column list
col = list(df.columns.values)
# column list except n
cov = list(df.columns[1:].values)

# merge absolute frequencies
count = pd.merge(
    # absolute freq of each (X,Y)
    df.groupby(cov).count(),
    # absolute freq of n for each (X,Y)
    df.groupby(col).aggregate("n").count(),
    # options
    on=cov, suffixes=["_u", "_ru"]
)
print(count)
# calculate ell metric
ell = np.sum(
    np.log(count["n_ru"]/count["n_u"])
)
print(f"ell = {ell:.3f}")
     n_ru  n_u
X Y
0 0     2    2
  1     1    2
  1     1    2
1 0     1    4
  0     3    4

ell = -3.060

Is there a better way to do it?


Solution

  • Use DataFrame.value_counts by all columns with divide Series by Series.div with count columns specified in list cov, so merge is not necessary here:

    ell = np.sum(np.log(df.value_counts().div(df[cov].value_counts())))
    
    print(f"ell = {ell:.3f}")
    ell = -3.060
    

    If need first count output:

    count = (df.value_counts()
               .to_frame('n_ru')
               .droplevel(0)
               .assign(n_u=lambda x: df[cov].value_counts()))
    print (count)
         n_ru  n_u
    X Y           
    1 0     3    4
    0 0     2    2
      1     1    2
    1 0     1    4
    0 1     1    2
    

    Another idea is aggregate sum:

    count = (df.value_counts()
               .to_frame('n_ru')
               .droplevel(0)
               .assign(n_u=lambda x: x.groupby(cov)['n_ru'].sum()))
    print (count)
         n_ru  n_u
    X Y           
    1 0     3    4
    0 0     2    2
      1     1    2
    1 0     1    4
    0 1     1    2