I'm comparing two different surveys for two different symptoms of migraine. Because I want to present the correlation for each symptom in both a table and a plot, I used cor.test() to retrieve the values for the table, and ggplot() and sm_statCorr() to create the plot. However, the two functions to calculate Spearman's rho give slightly different results. I suspect that it has to do with my data transformation, but I haven't been able to figure out wherein.
Here is a reproducible example:
install.packages("pacman")
pacman::p_load(stats, dplyr, ggplot2, smplot2)
# Import and transform data
data <- matrix(c(1, "headache", 1, NA, 1, "aura", 0, 0, 2, "headache", 1, 1, 2, "aura", 0, 0,
3, "headache", -1, -1, 3, "aura", 0, 1, 4, "headache", 1, 2, 4, "aura", -2, -2,
5, "headache", 0, 1, 5, "aura", 1, 1, 6, "headache", 2, 2, 6, "aura", 0, 0,
7, "headache", 0, 0, 7, "aura", 0, 0, 8, "headache", 1, 1, 8, "aura", 0, 0,
9, "headache", 1, 0, 9, "aura", 0, 0, 10, "headache", 1, -1, 10, "aura", 0, 0,
11, "headache", 0, 1, 11, "aura", 0, 0, 12, "headache", 0, 0, 12, "aura", 0, 0),
nrow = 24, ncol = 4, byrow = T)
colnames(data) = c("id", "symptom", "survey_1", "survey_2")
data <- as.matrix(data)
data <- as.data.frame(data)
data$id <- as.numeric(data$id)
data$survey_1 <- as.numeric(data$survey_1)
data$survey_2 <- as.numeric(data$survey_2)
data$symptom = factor(data$symptom, levels=c("headache", "aura"))
# Create plot with sm_statCorr
data_n <- data %>% group_by(symptom, survey_1, survey_2) %>% tally() %>% ungroup()
data_n <- data_n[,c("symptom", "survey_1", "survey_2", "n")]
ggplot(data_n, aes(x = survey_1, y = survey_2)) +
geom_point(aes(size = n)) +
sm_statCorr(data = data, aes(x = survey_1, y = survey_2), corr_method = "spearman") +
facet_wrap(. ~ symptom)
# Perform cor.test
cor.test(data[(data$symptom == "headache"), ]$survey_1,
data[(data$symptom == "headache"), ]$survey_2,
method = "spearman")
cor.test(data_aura <- data[(data$symptom == "aura"), ]$survey_1,
data_aura <- data[(data$symptom == "aura"), ]$survey_2,
method = "spearman")
The cor.test() gives me Spearman's rho 0.5028556 for headache and 0.8174239 for aura.
In the plot, however, the same values are given as 0.55 and 0.83, respectively:
The difference in statistics between the approaches is due to them being calculated using different data. With the plot, the statistics produced by sm_statCorr
are grouped by the symptom
column from the data_n
dataframe, as opposed to using data
for the cor.test
.
This happens as the facet
call looks for the facet
ting variable in the data at the top level ggplot
call, which in the question was data_n
.
You can get the same groupings by swapping the data that is used at the top level:
ggplot(data=data, aes(x = survey_1, y =survey_2)) +
geom_point(data=data_n, aes(x = survey_1, y = survey_2, size = n)) +
sm_statCorr(corr_method = "spearman") +
facet_wrap(. ~ symptom) # this now uses `symptom` from data`