Search code examples
pythonpython-3.xstatsmodelsposthoctukey

Correction for Multiple Comparison of Means - Tukey HSD in Python


I have a dateset with 4 conditions (A, B, C, D). What I observed running a One-Way Anova is that there is a linear increase of my dependent variable (Reaction Time, RT) in the 4 conditions.

I would like to run a post-hoc test to see if the increases of RT from A to B, from B to C, and C to D are significant with a Tukey HSD post-hoc test.

To run the test in Python, I am using the following code:

#Multiple Comparison of Means - Tukey HSD
from statsmodels.stats.multicomp import pairwise_tukeyhsd
print(pairwise_tukeyhsd(df["RT"], df['Cond']))

The problem I am facing is that here it is assumed that I am interested in all possible comparisons (A vs B, A vs C, A vs D, B vs C, B vs D, C vs D). Thus, the correction applied is based on 6 tests. However, I am only making hypothesis on 3 comparisons (A vs B, B vs C, C vs D).

How can I inform the post-hoc test about the number/type of comparisons I am interested in?


Solution

  • Unfortunately you cannot. Tukey HSD is not like your pairwise t test with a multiple comparison adjustment on the raw p-values. The p value you see is based on the studentized range (q) distribution.

    One way you can do this is to fit a linear model, which is like your anova, and you do a pairwise t-test on the coefficients, and subset on those that you need.

    To illustrate this, I use some simulated data, this is what TukeyHSD would look like:

    import pandas as pd
    import numpy as np
    from statsmodels.formula.api import ols
    from statsmodels.stats.multicomp import pairwise_tukeyhsd
    from statsmodels.stats.multitest import multipletests
    
    np.random.seed(123)
    
    df = pd.DataFrame({'RT':np.random.randn(100),'Cond':np.random.choice(['A','B','C','D'],100)})
    
    hs_res=pairwise_tukeyhsd(df["RT"], df['Cond'])
    print(hs_res)
    
    Multiple Comparison of Means - Tukey HSD, FWER=0.05
    ===================================================
    group1 group2 meandiff p-adj   lower  upper  reject
    ---------------------------------------------------
         A      B  -0.6598 0.2428 -1.5767 0.2571  False
         A      C  -0.3832 0.6946 -1.3334  0.567  False
         A      D   -0.634 0.2663 -1.5402 0.2723  False
         B      C   0.2766 0.7861 -0.5358 1.0891  False
         B      D   0.0258    0.9 -0.7347 0.7864  False
         C      D  -0.2508 0.8257 -1.0513 0.5497  False
    ---------------------------------------------------
    

    Now we do ols, and you can see it is pretty comparable :

    res = ols("RT ~ Cond", df).fit()
    pw = res.t_test_pairwise("Cond",method="sh")
    pw.result_frame
    
        coef    std err t   P>|t|   Conf. Int. Low  Conf. Int. Upp. pvalue-sh   reject-sh
    B-A -0.659798   0.350649    -1.881645   0.062914    -1.355831   0.036236    0.352497    False
    C-A -0.383176   0.363404    -1.054407   0.294343    -1.104528   0.338176    0.829463    False
    D-A -0.633950   0.346604    -1.829032   0.070499    -1.321954   0.054054    0.352497    False
    C-B 0.276622    0.310713    0.890281    0.375541    -0.340138   0.893382    0.829463    False
    D-B 0.025847    0.290885    0.088858    0.929380    -0.551555   0.603250    0.929380    False
    D-C -0.250774   0.306140    -0.819147   0.414731    -0.858458   0.356910    0.829463    False
    

    Then we choose the subset and method of correction, below I use simes-hochberg like above:

    subdf = pw.result_frame.loc[['B-A','C-B','D-C']]
    subdf['adj_p'] = multipletests(subdf['P>|t|'].values,method='sh')[1]
    subdf
    
        coef    std err t   P>|t|   Conf. Int. Low  Conf. Int. Upp. pvalue-sh   reject-sh   adj_p
    B-A -0.659798   0.350649    -1.881645   0.062914    -1.355831   0.036236    0.352497    False   0.188742
    C-B 0.276622    0.310713    0.890281    0.375541    -0.340138   0.893382    0.829463    False   0.414731
    D-C -0.250774   0.306140    -0.819147   0.414731    -0.858458   0.356910    0.829463    False   0.414731
    

    As a comment, if you see a trend there might be other models to model that, instead of relying on a posthoc test. Also subsetting on the test you need and performing a correction can be argued as some type of cherry picking.. If the number of comparisons (like in your example 6), I suggest you go with the Tukey. This is another discussion you can post on cross-validated.