Search code examples
pythonpandasscipymulti-indext-test

Use columns from one DataFrame as MultiIndex for t-test in another


What is the best practice for using the columns from one DataFrame as the indexes into another MultiIndexed DataFrame using Pandas to run a t-test?

I've seen a couple other similar questions that involved looping on here that don't seem like they would be ideal.

For example, I'd like to run a t-test on the groups specified in the following inds against those not in inds in the dat DataFrame.

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

np.random.seed(999)
dat = pd.DataFrame(data={"Group1" : np.random.randint(1, 3, 100),
                         "Group2" : np.random.randint(1, 5, 100),
                         "Value" : np.random.normal(size=100)})
dat.set_index(["Group1", "Group2"], inplace=True)

# How to use this as indices into MultiIndex of dat for t-test?
inds = pd.DataFrame(data={"Group1" : np.random.randint(1, 4, 20),
                          "Group2" : np.random.randint(2, 6, 20)})

# My attempt using joins, seems quite innefficient
inds["ind"] = True
inds.set_index(["Group1", "Group2"], inplace=True)

df = pd.merge(dat, inds, how='outer', left_index=True, right_index=True)
df['ind'].fillna(False, inplace=True)

# run test
tst = ttest_ind(df.loc[df['ind'], 'Value'],
                df.loc[~df['ind'], 'Value'], equal_var=False, nan_policy='omit')

Solution

  • How about searching index to get each subset for the t-test? This may be slightly more efficient.

    import numpy as np
    import pandas as pd
    from scipy.stats import ttest_ind
    
    np.random.seed(999)
    dat = pd.DataFrame(data={"Group1" : np.random.randint(1, 3, 100),
                             "Group2" : np.random.randint(1, 5, 100),
                             "Value" : np.random.normal(size=100)})
    dat.set_index(["Group1", "Group2"], inplace=True)
    
    # How to use this as indices into MultiIndex of dat for t-test?
    inds = pd.DataFrame(data={"Group1" : np.random.randint(1, 4, 20),
                              "Group2" : np.random.randint(2, 6, 20)})
    
    # Up to here the code is the same as yours (without inds["ind"] = True)
    inds.set_index(["Group1", "Group2"], inplace=True)
    
    # Only here is different (run test)
    tst = ttest_ind(dat.loc[dat.index.isin(inds.index), 'Value'],
                    dat.loc[~dat.index.isin(inds.index), 'Value'], equal_var=False, nan_policy='omit')
    

    As a side note, if I understand your intention correctly, you want to conduct t-test using total 100 samples. In order to achieve this in your original code, duplicated entries as a result of "outer" merge needs to be removed using df.drop_duplicates().

    Hope this helps.