Search code examples
pythonpandaspython-unittest

Loop T-Test for Comparison of Multiple Data Columns


I have a panda with 11 columns of data. I want to compare each column to every other column with a test (see below). How can I create a loop that automatically compares all columns without manually writing the code for each column-pair combination?

from scipy.stats import ttest_ind
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_ind(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

Is it possible to show the result in a matrix or graphically? Thank you in advance!


Solution

  • Let's use a nested dict comprehension to calculate the t-test for every possible combination of columns, then initialise a new dataframe from the nested dict to create nicely formatted matrix representation:

    dct = {x: {y: 's={:.2f}, p={:.2f}'.format(
              *ttest_ind(df[x], df[y])) for y in df} for x in df}
    mat = pd.DataFrame(dct)
    
    
    print(mat)
                     data1           data2
    data1   s=0.00, p=1.00  s=0.33, p=0.75
    data2  s=-0.33, p=0.75  s=0.00, p=1.00
    

    If you need the matrix containing only p-values:

    dct = {x: {y: ttest_ind(df[x], df[y]).pvalue for y in df} for x in df}
    mat = pd.DataFrame(dct)
    
    print(mat)
             data1    data2
    data1  1.00000  0.74847
    data2  0.74847  1.00000
    

    To calculate the mean of all p-values use:

    mat.to_numpy().mean()
    0.8742349436807844
    

    Note: df is the dataframe containing the columns data1, data2 ...