I'm comparing two different surveys for two different symptoms of migraine. Because I want to present the correlation for each symptom in both a table and a plot, I used cor.test() to retrieve the values for the table, and ggplot() and sm_statCorr() to create the plot. However, the two functions to calculate Spearman's rho give slightly different results. I suspect that it has to do with my data transformation, but I haven't been able to figure out wherein.
Here is a reproducible example:
pacman::p_load(stats, dplyr, ggplot2, smplot2)
# Import and transform data
data <- matrix(c(1, "headache", 1, NA, 1, "aura", 0, 0, 2, "headache", 1, 1, 2, "aura", 0, 0,
3, "headache", -1, -1, 3, "aura", 0, 1, 4, "headache", 1, 2, 4, "aura", -2, -2,
5, "headache", 0, 1, 5, "aura", 1, 1, 6, "headache", 2, 2, 6, "aura", 0, 0,
7, "headache", 0, 0, 7, "aura", 0, 0, 8, "headache", 1, 1, 8, "aura", 0, 0,
9, "headache", 1, 0, 9, "aura", 0, 0, 10, "headache", 1, -1, 10, "aura", 0, 0,
11, "headache", 0, 1, 11, "aura", 0, 0, 12, "headache", 0, 0, 12, "aura", 0, 0),
nrow = 24, ncol = 4, byrow = T)
colnames(data) = c("id", "symptom", "survey_1", "survey_2")
data <- as.matrix(data)
data <- as.data.frame(data)
data$id <- as.numeric(data$id)
data$survey_1 <- as.numeric(data$survey_1)
data$survey_2 <- as.numeric(data$survey_2)
data$symptom = factor(data$symptom, levels=c("headache", "aura"))
# Create plot with sm_statCorr
data_n <- data %>% group_by(symptom, survey_1, survey_2) %>% tally() %>% ungroup()
data_n <- data_n[,c("symptom", "survey_1", "survey_2", "n")]
ggplot(data_n, aes(x = survey_1, y = survey_2)) +
geom_point(aes(size = n)) +
sm_statCorr(data = data, aes(x = survey_1, y = survey_2), corr_method = "spearman") +
facet_wrap(. ~ symptom)
# Perform cor.test
cor.test(data[(data$symptom == "headache"), ]$survey_1,
data[(data$symptom == "headache"), ]$survey_2,
method = "spearman")
cor.test(data_aura <- data[(data$symptom == "aura"), ]$survey_1,
data_aura <- data[(data$symptom == "aura"), ]$survey_2,
method = "spearman")
The cor.test() gives me Spearman's rho 0.5028556 for headache and 0.8174239 for aura.
In the plot, however, the same values are given as 0.55 and 0.83, respectively:
The difference in statistics between the approaches is due to them being calculated using different data. With the plot, the statistics produced by sm_statCorr
are grouped by the symptom
column from the data_n
dataframe, as opposed to using data
for the cor.test
This happens as the facet
call looks for the facet
ting variable in the data at the top level ggplot
call, which in the question was data_n
You can get the same groupings by swapping the data that is used at the top level:
ggplot(data=data, aes(x = survey_1, y =survey_2)) +
geom_point(data=data_n, aes(x = survey_1, y = survey_2, size = n)) +
sm_statCorr(corr_method = "spearman") +
facet_wrap(. ~ symptom) # this now uses `symptom` from data`