Search code examples
rggplot2correlation

Different results using cor.test() and sm_statCorr() for spearman's rho?


I'm comparing two different surveys for two different symptoms of migraine. Because I want to present the correlation for each symptom in both a table and a plot, I used cor.test() to retrieve the values for the table, and ggplot() and sm_statCorr() to create the plot. However, the two functions to calculate Spearman's rho give slightly different results. I suspect that it has to do with my data transformation, but I haven't been able to figure out wherein.

Here is a reproducible example:

install.packages("pacman")
pacman::p_load(stats, dplyr, ggplot2, smplot2) 

# Import and transform data
data <- matrix(c(1, "headache", 1, NA, 1, "aura", 0, 0, 2, "headache", 1, 1, 2, "aura", 0, 0, 
                 3, "headache", -1, -1, 3, "aura", 0, 1, 4, "headache", 1, 2, 4, "aura", -2, -2, 
                 5, "headache", 0, 1, 5, "aura", 1, 1, 6, "headache", 2, 2, 6, "aura", 0, 0, 
                 7, "headache", 0, 0, 7, "aura", 0, 0, 8, "headache", 1, 1, 8, "aura", 0, 0, 
                 9, "headache", 1, 0, 9, "aura", 0, 0, 10, "headache", 1, -1, 10, "aura", 0, 0, 
                 11, "headache", 0, 1, 11, "aura", 0, 0, 12, "headache", 0, 0, 12, "aura", 0, 0), 
               nrow = 24, ncol = 4, byrow = T)
colnames(data) = c("id", "symptom", "survey_1", "survey_2")
data <- as.matrix(data)
data <- as.data.frame(data)
data$id <- as.numeric(data$id)
data$survey_1 <- as.numeric(data$survey_1)
data$survey_2 <- as.numeric(data$survey_2)
data$symptom = factor(data$symptom, levels=c("headache", "aura"))

# Create plot with sm_statCorr
data_n <- data %>% group_by(symptom, survey_1, survey_2) %>% tally() %>% ungroup()
data_n <- data_n[,c("symptom", "survey_1", "survey_2", "n")]

ggplot(data_n, aes(x = survey_1, y = survey_2)) + 
  geom_point(aes(size = n)) +
  sm_statCorr(data = data, aes(x = survey_1, y = survey_2), corr_method = "spearman") +
  facet_wrap(. ~ symptom)

# Perform cor.test
cor.test(data[(data$symptom == "headache"), ]$survey_1, 
         data[(data$symptom == "headache"), ]$survey_2,
         method = "spearman")
cor.test(data_aura <- data[(data$symptom == "aura"), ]$survey_1, 
         data_aura <- data[(data$symptom == "aura"), ]$survey_2,
         method = "spearman")

The cor.test() gives me Spearman's rho 0.5028556 for headache and 0.8174239 for aura. In the plot, however, the same values are given as 0.55 and 0.83, respectively: enter image description here


Solution

  • The difference in statistics between the approaches is due to them being calculated using different data. With the plot, the statistics produced by sm_statCorr are grouped by the symptom column from the data_n dataframe, as opposed to using data for the cor.test.

    This happens as the facet call looks for the facetting variable in the data at the top level ggplot call, which in the question was data_n.

    You can get the same groupings by swapping the data that is used at the top level:

    ggplot(data=data, aes(x = survey_1, y =survey_2)) +   
       geom_point(data=data_n, aes(x = survey_1, y = survey_2, size = n)) +   
       sm_statCorr(corr_method = "spearman") + 
       facet_wrap(. ~ symptom) # this now uses `symptom` from data`