I am an epidemiologist and I am quite new to R. I have a simple vaccination data in long format which looks like:
data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,3),date=c("01/12/2020","02/12/2020","03/12/2020","04/12/2020",
"01/31/2020","03/12/2020","04/05/2020","02/12/2020","04/12/2020","05/12/2020","01/12/2020"),vac_date=c("","02/02/2020","","04/02/2020","","","04/01/2020","","04/01/2020","05/01/2020",""),dose=c('',1,'',2,'','',1,'',1,2,''))
id: patient's identification
date: survey date
vac_date: vaccination date
dose: indicating the vaccination dose
I am really having trouble creating the frequency line plot in my mind. I tried
ggplot(data, aes(x = date, y = vac_date)) + geom_line()
The dates and counts of vaccination are confusing. I would like to compute 2 plots:
Might someone please provide some help on getting the above plots? Thanks.
You have a few different problems here.
NA
.date
column appears to be irrelevant here, assuming that the vacc_date
column is accurate. What we are interested in is the proportion of all participants who were actually vaccinated by a given date. The survey date itself is not needed.To fix all of this, I would first start by defining a start date and end date over which we want to display our plot. We should also count the number of participants in our data set.
participants <- length(unique(data$id))
start_date <- as.Date('2020-01-01')
end_date <- as.Date('2020-03-01')
Now let's tidy up the data, removing the useless date
column, and the useless rows with no vaccination information in them. We can then convert the dose
and vac_date
to the correct format, and calculate what proportion of participants were vaccinated by each date in the dataset:
library(tidyverse)
plot_df <- data %>%
select(-date) %>%
filter(nzchar(vac_date)) %>%
mutate(vac_date = lubridate::dmy(vac_date)) %>%
mutate(dose = as.numeric(dose)) %>%
group_by(dose) %>%
arrange(dose, vac_date) %>%
reframe(vac_date = c(start_date, vac_date, end_date),
vacc = c(0, row_number(), n()) / participants)
Now our data looks like this, with a column for the date of vaccination, a column for the percentage of participants who were vaccinated by that date, and a column indicating which dose of the vaccine they had received.
plot_df
#> # A tibble: 9 x 3
#> dose vac_date vacc
#> <dbl> <date> <dbl>
#> 1 1 2020-01-01 0
#> 2 1 2020-01-04 0.333
#> 3 1 2020-01-04 0.667
#> 4 1 2020-02-02 1
#> 5 1 2020-03-01 1
#> 6 2 2020-01-01 0
#> 7 2 2020-01-05 0.333
#> 8 2 2020-02-04 0.667
#> 9 2 2020-03-01 0.667
Our plotting code could then look something like this:
ggplot(plot_df, aes(vac_date, vacc)) +
geom_step(aes(color = factor(dose))) +
scale_y_continuous('Percent vaccinated', labels = scales::percent) +
scale_x_date('Date', date_labels = '%d %b %Y', date_breaks = 'month') +
scale_color_manual('Dose', values = c('navy', 'orangered')) +
theme_minimal(base_size = 16)
This allows us to see what percentage of participants had their first and second vaccinations at any moment in time.
Created on 2023-10-19 with reprex v2.0.2