Search code examples
pythonpandasfor-loopscipy

How to run t-test on multiple pandas columns


I want to write a code (with few lines) that runs t-test on Product and Purchase_cost,warranty_years and service_cost at the same time.

# dataset 

import pandas as pd
from scipy.stats import ttest_ind

data = {'Product': ['laptop', 'printer','printer','printer','laptop','printer','laptop','laptop','printer','printer'],
        'Purchase_cost': [120.09, 150.45, 300.12, 450.11, 200.55,175.89,124.12,113.12,143.33,375.65],
        'Warranty_years':[3,2,2,1,4,1,2,3,1,2],
        'service_cost': [5,5,10,4,7,10,4,6,12,3]
    
        }

df = pd.DataFrame(data)

print(df)

code attempt for Product & Purchase_cost. I want to run t-test for Product & warranty_years and Product & service cost


#define samples
group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']

#perform independent two sample t-test
ttest_ind(group1['Purchase_cost'], group2['Purchase_cost'])


Solution

  • ttest_ind can work on 2D (ND) inputs:

    cols = df.columns.difference(['Product'])
    # or with an explicit list
    # cols = ['Purchase_cost', 'Warranty_years', 'service_cost']
    
    group1 = df[df['Product']=='laptop']
    group2 = df[df['Product']=='printer']
    out = pd.DataFrame(ttest_ind(group1[cols], group2[cols]),
                       columns=cols, index=['statistic', 'pvalue'])
    

    If it wasn't, you could have used a dictionary comprehension looping over your columns:

    out = pd.DataFrame({c: ttest_ind(group1[c], group2[c]) for c in cols},
                        index=['statistic', 'pvalue'])
    

    Output:

               Purchase_cost  Warranty_years  service_cost
    statistic      -1.861113        3.513240     -0.919464
    pvalue          0.099760        0.007924      0.384738
    

    generalization to more pairs

    If you have more than just laptop/printer as products and want to compare all pairs, you could generalize with:

    from itertools import combinations
    
    cols = df.columns.difference(['Product'])
    
    g = df.groupby('Product')[cols]
    
    out = pd.concat({(a,b): pd.DataFrame(ttest_ind(g.get_group(a), g.get_group(b)),
                                         columns=cols, index=['statistic', 'pvalue'])
                     for a, b in combinations(df['Product'].unique(), 2)
                    }, names=['product1', 'product2'])
    

    Example output with an extra category (phone):

                                 Purchase_cost  Warranty_years  service_cost
    product1 product2                                                       
    laptop   printer  statistic      -1.861113        3.513240     -0.919464
                      pvalue          0.099760        0.007924      0.384738
             phone    statistic      -1.945836        2.988072      2.766417
                      pvalue          0.109251        0.030515      0.039533
    printer  phone    statistic      -1.286968        0.423659      1.893370
                      pvalue          0.239026        0.684528      0.100178
    

    If you have many combinations, note that you should likely post-process the data to account for multiple testing.