Search code examples
pythonpandast-test

two sided t-test in pandas


If I have a df like this:

        normalized_0  normalized_1  normalized_0   mean      std
Site                           
0           NaN      0.798262      1.456576       0.888687  0.118194
1      0.705540      0.885226           NaN       0.761488  0.047023
2      0.669539      1.002526      1.212976       0.826657  0.077940
3      0.829826      0.968180      0.988679       0.871290  0.032367

How do I calculate a two sided t-test for 0, 1, 2 vs. 3?

I tried it with:

from scipy.stats import ttest_ind

df['ttest'] = ttest_ind(df, d.loc[3])

But this does not work... The error I get is:

TypeError: unsupported operand type(s) for /: 'str' and 'int'

How would you solve this?


Solution

  • My answer might be completely off, as I've only read about t-tests :)

    What I understood from your question is that you have a table with both normalized values and their descriptive statistics (mean, std).

    Each index value within this table is a category of your analysis, and you want to compare categories [0, 1, 2] vs [3].

    I also assume you only need normalized values as input arrays, without mean or std.


    selected_data = df.copy()
    selected_data = selected_data[['normalized_0', 'normalized_1', 'normalized_0.1']]
    selected_data['ttest'] = [ttest_ind(a=selected_data.iloc[3, :].values, \
                                        b=selected_data.iloc[x, :].values, \
                                        nan_policy='omit') for x in np.arange(len(selected_data))]
    
    df.join(selected_data['ttest'])
    
            normalized_0  normalized_1  normalized_0.1 mean      std       ttest 
    Site                           
    0           NaN      0.798262      1.456576       0.888687  0.118194  (-0.7826642930343911, 0.4909212050511221)
    1      0.705540      0.885226           NaN       0.761488  0.047023  (1.4370158341444121, 0.24625840339538163)  
    2      0.669539      1.002526      1.212976       0.826657  0.077940  (-0.19764518466194855, 0.8529602343240825)
    3      0.829826      0.968180      0.988679       0.871290  0.032367  (0.0, 1.0)
    

    a and b parameters are row values of selected columns

    # values of third category for example
    selected_data.iloc[3, :].values 
    # array([0.829826, 0.96818 , 0.988679])
    

    omit is to ignore nan values when calculating test (by default the parameter of nan_policy is set to propagate which returns nan if any missing values are present).