I have dataframe like this:
features_df = pd.DataFrame({
'group': np.array([0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1]),
'variable': ['var1'] * 8 + ['var2'] * 8,
'value': np.array([5.582443, 7.855871, 9.843828, 8.331354, 1.593624, 2.151113, 1.403245, 3.495429,
5.361531, 6.739888, 4.120531, 9.931341, 1.121117, 0.730207, 0.931132, 3.001303])
})
features_df
group variable value
0 0 var1 5.582443
1 0 var1 7.855871
2 0 var1 9.843828
3 0 var1 8.331354
4 1 var1 1.593624
5 1 var1 2.151113
6 1 var1 1.403245
7 1 var1 3.495429
8 0 var2 5.361531
9 0 var2 6.739888
10 0 var2 4.120531
11 0 var2 9.931341
12 1 var2 1.121117
13 1 var2 0.730207
14 1 var2 0.931132
15 1 var2 3.001303
And i want to calculate p-value from T-Test for each variable between groups. I can manually calculate each p-value like this:
var1_0 = features_df.query('variable == "var1" & group == 0').value.values
var1_1 = features_df.query('variable == "var1" & group == 1').value.values
var2_0 = features_df.query('variable == "var2" & group == 0').value.values
var2_1 = features_df.query('variable == "var2" & group == 1').value.values
var1_pvalue = ttest_ind(var1_0, var1_1)[1]
var1_pvalue
#0.0012163722443546759
var2_pvalue = ttest_ind(var2_0, var2_1)[1]
var2_pvalue
#0.00946879342461542
So the question is how can i get a result dataframe like shown below for all variables automatically?
variables ttest_pvalue
0 var1 0.001216
1 var2 0.009469
There are several ways, the core idea is to use groupby
on the variable.
Here is one example:
from scipy.stats import ttest_ind
(features_df
.set_index('group')
.groupby('variable', as_index=False)['value']
.apply(lambda g: ttest_ind(g[0], g[1])[1])
)
output:
variable value
0 var1 0.001216
1 var2 0.009469