Search code examples
pythonpandaspandas-apply

Apply T-Test test per group


I have dataframe like this:

features_df = pd.DataFrame({
    'group': np.array([0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1]),
    'variable': ['var1'] * 8 + ['var2'] * 8,
    'value': np.array([5.582443, 7.855871, 9.843828, 8.331354, 1.593624, 2.151113, 1.403245, 3.495429,
                         5.361531, 6.739888, 4.120531, 9.931341, 1.121117, 0.730207, 0.931132, 3.001303])
})

features_df

  group variable value
0   0   var1    5.582443
1   0   var1    7.855871
2   0   var1    9.843828
3   0   var1    8.331354
4   1   var1    1.593624
5   1   var1    2.151113
6   1   var1    1.403245
7   1   var1    3.495429
8   0   var2    5.361531
9   0   var2    6.739888
10  0   var2    4.120531
11  0   var2    9.931341
12  1   var2    1.121117
13  1   var2    0.730207
14  1   var2    0.931132
15  1   var2    3.001303

And i want to calculate p-value from T-Test for each variable between groups. I can manually calculate each p-value like this:

var1_0 = features_df.query('variable == "var1" & group == 0').value.values
var1_1 = features_df.query('variable == "var1" & group == 1').value.values

var2_0 = features_df.query('variable == "var2" & group == 0').value.values
var2_1 = features_df.query('variable == "var2" & group == 1').value.values


var1_pvalue = ttest_ind(var1_0, var1_1)[1]
var1_pvalue
#0.0012163722443546759

var2_pvalue = ttest_ind(var2_0, var2_1)[1]
var2_pvalue
#0.00946879342461542

So the question is how can i get a result dataframe like shown below for all variables automatically?


variables   ttest_pvalue
0   var1    0.001216
1   var2    0.009469

Solution

  • There are several ways, the core idea is to use groupby on the variable.

    Here is one example:

    from scipy.stats import ttest_ind
    
    (features_df
     .set_index('group')
     .groupby('variable', as_index=False)['value']
     .apply(lambda g: ttest_ind(g[0], g[1])[1])
    )
    

    output:

      variable     value
    0     var1  0.001216
    1     var2  0.009469