I want to write a code (with few lines) that runs t-test on Product
and Purchase_cost
,warranty_years
and service_cost
at the same time.
# dataset
import pandas as pd
from scipy.stats import ttest_ind
data = {'Product': ['laptop', 'printer','printer','printer','laptop','printer','laptop','laptop','printer','printer'],
'Purchase_cost': [120.09, 150.45, 300.12, 450.11, 200.55,175.89,124.12,113.12,143.33,375.65],
'Warranty_years':[3,2,2,1,4,1,2,3,1,2],
'service_cost': [5,5,10,4,7,10,4,6,12,3]
}
df = pd.DataFrame(data)
print(df)
code attempt for Product
& Purchase_cost
. I want to run t-test for Product
& warranty_years
and Product
& service cost
#define samples
group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']
#perform independent two sample t-test
ttest_ind(group1['Purchase_cost'], group2['Purchase_cost'])
ttest_ind
can work on 2D (ND) inputs:
cols = df.columns.difference(['Product'])
# or with an explicit list
# cols = ['Purchase_cost', 'Warranty_years', 'service_cost']
group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']
out = pd.DataFrame(ttest_ind(group1[cols], group2[cols]),
columns=cols, index=['statistic', 'pvalue'])
If it wasn't, you could have used a dictionary comprehension looping over your columns:
out = pd.DataFrame({c: ttest_ind(group1[c], group2[c]) for c in cols},
index=['statistic', 'pvalue'])
Output:
Purchase_cost Warranty_years service_cost
statistic -1.861113 3.513240 -0.919464
pvalue 0.099760 0.007924 0.384738
If you have more than just laptop/printer as products and want to compare all pairs, you could generalize with:
from itertools import combinations
cols = df.columns.difference(['Product'])
g = df.groupby('Product')[cols]
out = pd.concat({(a,b): pd.DataFrame(ttest_ind(g.get_group(a), g.get_group(b)),
columns=cols, index=['statistic', 'pvalue'])
for a, b in combinations(df['Product'].unique(), 2)
}, names=['product1', 'product2'])
Example output with an extra category (phone):
Purchase_cost Warranty_years service_cost
product1 product2
laptop printer statistic -1.861113 3.513240 -0.919464
pvalue 0.099760 0.007924 0.384738
phone statistic -1.945836 2.988072 2.766417
pvalue 0.109251 0.030515 0.039533
printer phone statistic -1.286968 0.423659 1.893370
pvalue 0.239026 0.684528 0.100178
If you have many combinations, note that you should likely post-process the data to account for multiple testing.