How to run t-test on multiple pandas columns

I want to write a code (with few lines) that runs t-test on Product and Purchase_cost,warranty_years and service_cost at the same time.

# dataset 

import pandas as pd
from scipy.stats import ttest_ind

data = {'Product': ['laptop', 'printer','printer','printer','laptop','printer','laptop','laptop','printer','printer'],
        'Purchase_cost': [120.09, 150.45, 300.12, 450.11, 200.55,175.89,124.12,113.12,143.33,375.65],
        'Warranty_years':[3,2,2,1,4,1,2,3,1,2],
        'service_cost': [5,5,10,4,7,10,4,6,12,3]
    
        }

df = pd.DataFrame(data)

print(df)

code attempt for Product & Purchase_cost. I want to run t-test for Product & warranty_years and Product & service cost


#define samples
group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']

#perform independent two sample t-test
ttest_ind(group1['Purchase_cost'], group2['Purchase_cost'])

Solution

ttest_ind can work on 2D (ND) inputs:

cols = df.columns.difference(['Product'])
# or with an explicit list
# cols = ['Purchase_cost', 'Warranty_years', 'service_cost']

group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']
out = pd.DataFrame(ttest_ind(group1[cols], group2[cols]),
                   columns=cols, index=['statistic', 'pvalue'])

If it wasn't, you could have used a dictionary comprehension looping over your columns:

out = pd.DataFrame({c: ttest_ind(group1[c], group2[c]) for c in cols},
                    index=['statistic', 'pvalue'])

Output:

           Purchase_cost  Warranty_years  service_cost
statistic      -1.861113        3.513240     -0.919464
pvalue          0.099760        0.007924      0.384738

generalization to more pairs

If you have more than just laptop/printer as products and want to compare all pairs, you could generalize with:

from itertools import combinations

cols = df.columns.difference(['Product'])

g = df.groupby('Product')[cols]

out = pd.concat({(a,b): pd.DataFrame(ttest_ind(g.get_group(a), g.get_group(b)),
                                     columns=cols, index=['statistic', 'pvalue'])
                 for a, b in combinations(df['Product'].unique(), 2)
                }, names=['product1', 'product2'])

Example output with an extra category (phone):

                             Purchase_cost  Warranty_years  service_cost
product1 product2                                                       
laptop   printer  statistic      -1.861113        3.513240     -0.919464
                  pvalue          0.099760        0.007924      0.384738
         phone    statistic      -1.945836        2.988072      2.766417
                  pvalue          0.109251        0.030515      0.039533
printer  phone    statistic      -1.286968        0.423659      1.893370
                  pvalue          0.239026        0.684528      0.100178

If you have many combinations, note that you should likely post-process the data to account for multiple testing.