I have a dataset and stacked bar chart similar to what’s shown below. The chart is for prevalence of a specific disease (CVD=0 for no, 1 for yes) over time, by treatment type (trt=1 or 2).
I am wondering if someone could please help me with:
a) change over time was statistically significant b) are there statistically significant differences between trt 1 and 2?
Any help/guidance would be appreciated.
Thanks!
Here’s a mock dataset similar to what I am working on and some code I’ve used to generate the stacked plot.
ID = c(001, 001, 001, 001, 001, 002, 002, 002, 003, 003, 004, 004, 004, 004, 005, 005, 006, 007, 007, 007, 008, 008, 008, 008, 008, 009, 009, 009, 009, 009)
Visit = c(00, 01, 02, 03, 04, 00, 01, 02, 01, 02, 00, 01, 02, 03, 00, 02, 00, 01, 02, 04, 00, 01, 02, 03, 04, 00, 01, 02, 03, 04)
CVD = c(0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0)
TRT= c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
BIO=c(12.00, 11.9, 15.24, 13.10, 30.01, 45.90, 20.09, 23.45, 14.78, 18.05,
24.23, 12.34, 80.01, 13.98, 12.50, 36.95, 29.00, 39.87, 19.03, 11.48,
14.14, 28.06, 12.22, 72.08, 15.00, 11.33, 58.00, 17.71, 52.08, 15.25)
df<-data.frame(ID,Visit, CVD, TRT, BIO)
percentData <- df %>% group_by(Visit, TRT) %>% count(CVD) %>%
mutate(ratio=scales::percent(n/sum(n))) %>%
ungroup()
ggplot(df,aes(x=Visit,fill=factor(CVD)))+
geom_bar(position="fill")+ facet_wrap(~TRT)+
geom_text(data=percentData, aes(y=n,label=ratio),
position=position_fill(vjust=0.5))+
ylab('Percentage')
This is quite a lot to ask in a single question. The way I would tackle this is to first of all generate a data frame with all the numbers you will need for plotting and tabling results:
library(tidyverse)
df2 <- df %>%
group_by(Visit, TRT) %>%
summarize(no_CVD = table(factor(CVD, 0:1))[1],
CVD = table(factor(CVD, 0:1))[2],
n = no_CVD + CVD,
prop = CVD / n,
lower = suppressWarnings(prop.test(CVD, n)$conf.int[1]),
upper = suppressWarnings(prop.test(CVD, n)$conf.int[2])) %>%
mutate(Percent = scales::percent(prop, 0.1),
CI = paste0("(", scales::percent(lower, 0.1), " - ",
scales::percent(upper, 0.1), ")"),
TRT = paste("Treatment", TRT))
You can see this gives us the point estimates and confidence intervals as both raw proportions and as text labels, which should be adequate to fulfill question 3:
df2
#> # A tibble: 10 x 10
#> # Groups: Visit [5]
#> Visit TRT no_CVD CVD n prop lower upper Percent CI
#> <dbl> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0 Treatment 1 3 1 4 0.25 0.0132 0.781 25.0% (1.3% - 78.1~
#> 2 0 Treatment 2 0 3 3 1 0.310 1 100.0% (31.0% - 100~
#> 3 1 Treatment 1 2 2 4 0.5 0.150 0.850 50.0% (15.0% - 85.~
#> 4 1 Treatment 2 2 1 3 0.333 0.0177 0.875 33.3% (1.8% - 87.5~
#> 5 2 Treatment 1 1 3 4 0.75 0.219 0.987 75.0% (21.9% - 98.~
#> 6 2 Treatment 2 3 1 4 0.25 0.0132 0.781 25.0% (1.3% - 78.1~
#> 7 3 Treatment 1 0 2 2 1 0.198 1 100.0% (19.8% - 100~
#> 8 3 Treatment 2 1 1 2 0.5 0.0945 0.905 50.0% (9.5% - 90.5~
#> 9 4 Treatment 1 1 2 3 0.667 0.125 0.982 66.7% (12.5% - 98.~
#> 10 4 Treatment 2 1 0 1 0 0 0.945 0.0% (0.0% - 94.5~
To put the N under each visit, we can simply paste
the visit number with a string containing the counts from this table (this handles question 1).
Question 2 is a little odd. The sample prevalence of CVD was already present on your plot (that's what the blue bars showed). Adding error bars to a stacked bar graph is going to look messy, and the bars don't actually give any information that a simple point does in this scenario. I have therefore changed the plot to show points with error bars.
ggplot(df2, aes(paste0(Visit, "\n(n = ", n, ")"), prop)) +
geom_errorbar(aes(ymin = lower, ymax = upper, color = factor(TRT)),
size = 1, width = 0.25) +
geom_point(size = 2) +
geom_text(aes(label = paste(Percent, CI, sep = "\n"), y = 1.1), size = 3) +
facet_grid(.~TRT, scales = "free_x") +
scale_y_continuous(labels = scales::percent, name = "CVD Prevalance") +
xlab("Visit") +
theme_light(base_size = 16) +
theme(legend.position = "none") +
scale_color_brewer(palette = "Set1")
Your last two points are statistical rather than programming questions, so are off-topic here. It is likely that you need a binomial generalized linear mixed model to handle the repeated measures, but you should post that question on CrossValidated, our sister site for stats questions if you do not have access to a statistician.
Created on 2022-06-21 by the reprex package (v2.0.1)