I am trying to do a two sample t test to check if there is significant difference in mean between two datasets.
I have two datasets and each dataset has 5 trials and each Trial has 3 features. Every Trial has different unique label but the 3 features(X1,X2,X3 are same across all). On every individual Trial we are measuring the 3 features and the measurement values are displayed below. I am trying to calculate the mean difference for each feature across both the datasets.
This is how my data looks after when i get it from SQL.
Data Set 1:
T1 X1 0.93
T1 X2 0.3
T1 X3 -2.9
T2 X1 1.3
T2 X2 0.8
T2 X3 1.9
T3 X1 2.3
T3 X2 -1.8
T3 X3 0.9
T4 X1 0.3
T4 X2 0.8
T4 X3 0.9
T5 X1 0.3
T5 X2 0.8
T5 X3 0.9
Data Set 2:
T10 X1 1.3
T10 X2 -2.8
T10 X3 0.09
T11 X1 3.3
T11 X2 0.8
T11 X3 1.9
T12 X1 0.3
T12 X2 -4.8
T12 X3 2.9
T13 X1 1.3
T13 X2 2.8
T13 X3 0.19
T14 X1 2.3
T14 X2 0.08
T14 X3 -0.9
This is how i want my output to look, where i want the ttest to be applied to each Feature, so I can get the p value for each feature
Feature Mean-DataSET1 Mean-DataSET2 P-value
X1
X2
X3
when i do stats.ttest_ind(set1['value'], set2['value']).pvalue , I am getting one single pvalue
Thanks!
I written your output above to two tab delimited files, and I read it in below, and add a column to indicate the dataframe or table it is from:
import pandas as pd
from scipy.stats import ttest_ind
t1 = pd.read_csv("../t1.csv",names=['V1','V2','V3'],sep="\t")
t1['data'] = 'data1'
t2 = pd.read_csv("../t2.csv",names=['V1','V2','V3'],sep="\t")
t2['data'] = 'data2'
V1 V2 V3 data
0 T1 X1 0.93 data1
1 T1 X2 0.30 data1
2 T1 X3 -2.90 data1
3 T2 X1 1.30 data1
Then we concatenate them and calculating the mean is straight forward:
df = pd.concat([t1,t2])
res = df.groupby("V2").apply(lambda x:x['V3'].groupby(x['data']).mean())
data data1 data2
V2
X1 1.026 1.700
X2 0.180 -0.784
X3 0.340 0.836
p.value requires a bit more coding within the apply:
res['pvalue'] = df.groupby("V2").apply(lambda x:
ttest_ind(x[x['data']=="data1"]["V3"],x[x['data']=="data2"]["V3"])[1])
data data1 data2 pvalue
V2
X1 1.026 1.700 0.316575
X2 0.180 -0.784 0.521615
X3 0.340 0.836 0.657752
You can always choose to do res.reset_index()
to get a table..