Search code examples
pythonnantukey

NaN appearing in results for a Tukey's HSD in Python


I am trying to perform Tukey's HSD test to see if there are significant differences in the mean's of values for several groups in my data. For example, here I am trying to see if there are mean differences in variable 'acad_se_communicate_needs' by groups 'Class'. However, I am encountering NaN values in my results. What is going on here, and how might I fix it?

I have used statsmodels functions to do this. I have avoided methods that require splitting data into different dataframes for each group, because I have to perform this analysis for several variables. Also, those methods are really difficult for me to understand.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

mc = MultiComparison(clean['acad_se_communicate_needs'], clean['Class'])
result = mc.tukeyhsd()
print(result)

My output is as follows... nan's everywhere!

Multiple Comparison of Means - Tukey HSD,FWER=0.05
==============================================
 group1    group2  meandiff lower upper reject
----------------------------------------------
Freshman   Junior    nan     nan   nan  False 
Freshman   Senior    nan     nan   nan  False 
Freshman Sophomore   nan     nan   nan  False 
 Junior    Senior    nan     nan   nan  False 
 Junior  Sophomore   nan     nan   nan  False 
 Senior  Sophomore   nan     nan   nan  False 
----------------------------------------------

There are nan values (missing). I tried some code to remove missing values. That code looks like
sm.stats.multicomp.pairwise_tukeyhsd('acad_se_communicate_needs','Class', alpha=0.05, missing = 'drop')

However, I get an error that says "pairwise_tukeyhsd() got an unexpected keyword argument 'missing'".


Solution

  • I ended up creating a new dataframe filtering the columns representing only the two variables, then dropped missing values. Then, I performed the Tukey's HSD test.

        cleanTukey1 = clean.filter(items=['acad_se_communicate_needs', 'Class']).dropna()
        from statsmodels.stats.multicomp import pairwise_tukeyhsd
        from statsmodels.stats.multicomp import MultiComparison
        mc1 = MultiComparison(cleanTukey1['acad_se_communicate_needs'], cleanTukey1['Class'])
        result1 = mc1.tukeyhsd()
        print(result1)
        print(mc1.groupsunique)