Search code examples
pythonpandasscipystatisticst-test

Two sample t-test for every individual row in Python


I am trying to do a two sample t test to check if there is significant difference in mean between two datasets.

I have two datasets and each dataset has 5 trials and each Trial has 3 features. Every Trial has different unique label but the 3 features(X1,X2,X3 are same across all). On every individual Trial we are measuring the 3 features and the measurement values are displayed below. I am trying to calculate the mean difference for each feature across both the datasets.

This is how my data looks after when i get it from SQL.

Data Set 1:

T1  X1   0.93
T1  X2   0.3
T1  X3   -2.9
T2  X1   1.3
T2  X2   0.8
T2  X3   1.9
T3  X1   2.3
T3  X2   -1.8
T3  X3   0.9
T4  X1   0.3
T4  X2   0.8
T4  X3   0.9
T5  X1   0.3
T5  X2   0.8
T5  X3   0.9

Data Set 2:

T10 X1  1.3
T10 X2  -2.8
T10 X3  0.09
T11 X1  3.3
T11 X2  0.8
T11 X3  1.9
T12 X1  0.3
T12 X2  -4.8
T12 X3  2.9
T13 X1  1.3
T13 X2  2.8
T13 X3  0.19
T14 X1  2.3
T14 X2  0.08
T14 X3  -0.9

This is how i want my output to look, where i want the ttest to be applied to each Feature, so I can get the p value for each feature

Feature  Mean-DataSET1  Mean-DataSET2  P-value 
X1
X2
X3  

when i do stats.ttest_ind(set1['value'], set2['value']).pvalue , I am getting one single pvalue

Thanks!


Solution

  • I written your output above to two tab delimited files, and I read it in below, and add a column to indicate the dataframe or table it is from:

    import pandas as pd
    from scipy.stats import ttest_ind
    t1 = pd.read_csv("../t1.csv",names=['V1','V2','V3'],sep="\t")
    t1['data'] = 'data1'
    t2 = pd.read_csv("../t2.csv",names=['V1','V2','V3'],sep="\t")
    t2['data'] = 'data2'
    
        V1  V2  V3  data
    0   T1  X1  0.93    data1
    1   T1  X2  0.30    data1
    2   T1  X3  -2.90   data1
    3   T2  X1  1.30    data1
    

    Then we concatenate them and calculating the mean is straight forward:

    df = pd.concat([t1,t2])
    res = df.groupby("V2").apply(lambda x:x['V3'].groupby(x['data']).mean())
    data    data1   data2
    V2      
    X1  1.026   1.700
    X2  0.180   -0.784
    X3  0.340   0.836
    

    p.value requires a bit more coding within the apply:

    res['pvalue'] = df.groupby("V2").apply(lambda x:
                                           ttest_ind(x[x['data']=="data1"]["V3"],x[x['data']=="data2"]["V3"])[1])
    data    data1   data2   pvalue
    V2          
    X1  1.026   1.700   0.316575
    X2  0.180   -0.784  0.521615
    X3  0.340   0.836   0.657752
    

    You can always choose to do res.reset_index() to get a table..