Search code examples
pythonpandasdataframestatisticssummary

Conditional statistic summary dataframe in python


I'm trying to get a table with the statistics (mean, var, standard deviation, among others) for A and B given Y=1 and Y=0. For example:

Given this dataframe:

df = pd.DataFrame({'A': [0,    0.91, np.NaN, 0.75,   np.NaN, 1], 
                   'B': [0.43, 1,    0.34,   np.NaN, 0,      0.64],
                   'Y': [1,    0,    1,      1,      0,      1]
                      })

I'm computing the statistics with:

for i in df:
    print(i)
    print("Mean Y1 " + " " + str(df[i][df["Y"]==1].mean()))
    print("Mean Y0 " + " " + str(df[i][df["Y"]==0].mean()))
    print("Var Y1 " + " " + str(np.var(df[i][df["Y"]==1])))
    print("Var Y0 " + " " + str(np.var(df[i][df["Y"]==0])))

However, I can't compare them, so I'm trying to create a table with the statistics like this:

stats = pd.DataFrame({'Column names': ['A', 'B', 'Y']
                   'Mean Y1': [A_mean_given_Y==1, B_mean_given_Y==1, Z], 
                   'Mean Y0': [A_mean_given_Y==0, B_mean_given_Y==0, Z],
                   'Var Y1': [A_var_given_Y==1,   B_var_given_Y==1,  Z],
                   'Var Y0': [A_var_given_Y==0,   B_var_given_Y==0,  Z] 
                  })

# NOTE: Z is any number, as its value doesn't matter.

However, a df doesn't accept the function .append as it's for lists. And convert a list of lists in a dataframe after computing the statistics, is very inefficient. So, any idea how can I create the stats dataframe with a loop?


Solution

  • I did in this way at the end given its flexibility (you are not constrained by the agg function for example, you can put any function in the table just adding it in the loop):

     df = pd.DataFrame({'A': [0,    0.91, np.NaN, 0.75,   np.NaN, 1], 
                       'B': [0.43, 1,    0.34,   np.NaN, 0,      0.64],
                       'Y': [1,    0,    1,      1,      0,      1]
                          })   
    stats = []
    for i in df:
        new_row = [
            i,
            df[i][df["Y"]==1].mean(),
            df[i][df["Y"]==0].mean(),
            np.nanvar(df[i][df["Y"]==1]),
            np.nanvar(df[i][df["Y"]==0]),
        ]
        stats.append(new_row)
    
    col_stats= ['Variable', 'Mean Y=1', 'Mean Y=0', 'Var Y=1', 'Var Y=0']
    stats = pd.DataFrame(stats, columns=col_stats)
    stats