Search code examples
rcorrelationmissing-data

Correlation testing with missing data


I want to test for sexual dependency in my data set, which consists of ordinal data. This means, that I have the sexes male (named as 1) and female (named as 2), and several traits (T1, T2, T3,...) of different ordinal scale (some ranging from 0-2, others ranging from 0-5 - or in words from "not present" to "strongly expressed"). Additionally, there are quite a few missing entries (NA) in the ordinal trait data.

sex T1
1 0
2 2
1 NA
2 1
2 0

To test for sexual dependency, I want to use Kendall's tau coefficient. For this, I used cor() and cor.test() with method = "kendall". However, I am not sure if I did it correctly. The outcome of cor() makes me feel insecure:

cor(data$sex, data$T1, method="kendall")
[1] NA
cor.test(data$sex, data$T1, method="kendall")

    Kendall's rank correlation tau

data:  data$sex and data$T1
z = 0.052821, p-value = 0.9579
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.0120125 

What does the NA mean? And is the result still reliable? Or did I make a mistake? Are there any other suggestions to test for sexual dependency in ordinal traits? Normally in such a study design, the ordinal data would have been dichotomized (0 and 1) and Fisher's Exact Test would have been used. However, dichotomizing is not my aim and I need to retain the ordinal scale.


Solution

  • As mentioned by the other comments/answers, the base R correlation function is a vector-based function that will automatically pass NA values into the correlation, thus making it only display NA values. There are a couple ways around this shown below. First, I recreated your data:

    #### Recreate Data ####
    sex <- c(1,2,1,2,2)
    t1 <- c(0,2,NA,1,0)
    df <- data.frame(sex,t1)
    df
    

    Then using the "complete.obs" argument, you can get the Kendall correlation without the NA values:

    #### Base R Method ####
    cor(sex,
        t1,
        use = "complete.obs",
        method = "kendall")
    

    Shown below:

    [1] 0.5163978
    

    Additionally you can use the correlation package from the same-named library, which automatically throws out NA values:

    #### Correlation Package ####
    correlation::correlation(df, method = "kendall")
    

    Shown below:

    # Correlation Matrix (kendall-method)
    
    Parameter1 | Parameter2 |  tau |        95% CI |    z |     p
    -------------------------------------------------------------
    sex        |         t1 | 0.52 | [-1.00, 1.00] | 0.94 | 0.346
    
    p-value adjustment method: Holm (1979)
    Observations: 4
    

    The advantages of this function are 1) you can use a dplyr workflow to select, filter, etc. and apply this function after 2) it has a self-contained table with your CIs, t values, p values, etc. 3) it highlights how many observations were used, which the base R function does not say.