frequency or proportion plot by date or week with ggplot2

I am an epidemiologist and I am quite new to R. I have a simple vaccination data in long format which looks like:

data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,3),date=c("01/12/2020","02/12/2020","03/12/2020","04/12/2020",
"01/31/2020","03/12/2020","04/05/2020","02/12/2020","04/12/2020","05/12/2020","01/12/2020"),vac_date=c("","02/02/2020","","04/02/2020","","","04/01/2020","","04/01/2020","05/01/2020",""),dose=c('',1,'',2,'','',1,'',1,2,''))

id: patient's identification
date: survey date
vac_date: vaccination date
dose: indicating the vaccination dose

I am really having trouble creating the frequency line plot in my mind. I tried

ggplot(data, aes(x = date, y = vac_date)) + geom_line()

The dates and counts of vaccination are confusing. I would like to compute 2 plots:

frequency or proportion plot by date or week regardless of dose
frequency or proportion plot by date or week by dose (overlay) as shown in the following pic

https://imgur.com/KEoV4cR

Might someone please provide some help on getting the above plots? Thanks.

Solution

You have a few different problems here.

The first is that your data is in the wrong format. You cannot use empty strings in a numeric column in R to indicate missing data. That just turns the whole column into a character vector, meaning you cannot perform any maths operations on it. Instead, missing values in numeric or date columns should be labelled as NA.
Secondly, your dates are currently just character strings and need to be converted to actual dates.
Thirdly, the date column appears to be irrelevant here, assuming that the vacc_date column is accurate. What we are interested in is the proportion of all participants who were actually vaccinated by a given date. The survey date itself is not needed.

To fix all of this, I would first start by defining a start date and end date over which we want to display our plot. We should also count the number of participants in our data set.

participants <- length(unique(data$id))
start_date <- as.Date('2020-01-01')
end_date <- as.Date('2020-03-01')

Now let's tidy up the data, removing the useless date column, and the useless rows with no vaccination information in them. We can then convert the dose and vac_date to the correct format, and calculate what proportion of participants were vaccinated by each date in the dataset:

library(tidyverse) 

plot_df <- data %>%
  select(-date) %>%
  filter(nzchar(vac_date)) %>%
  mutate(vac_date = lubridate::dmy(vac_date)) %>%
  mutate(dose = as.numeric(dose)) %>%
  group_by(dose) %>%
  arrange(dose, vac_date) %>%
  reframe(vac_date = c(start_date, vac_date, end_date),
          vacc = c(0, row_number(), n()) / participants)

Now our data looks like this, with a column for the date of vaccination, a column for the percentage of participants who were vaccinated by that date, and a column indicating which dose of the vaccine they had received.

plot_df
#> # A tibble: 9 x 3
#>    dose vac_date    vacc
#>   <dbl> <date>     <dbl>
#> 1     1 2020-01-01 0    
#> 2     1 2020-01-04 0.333
#> 3     1 2020-01-04 0.667
#> 4     1 2020-02-02 1    
#> 5     1 2020-03-01 1    
#> 6     2 2020-01-01 0    
#> 7     2 2020-01-05 0.333
#> 8     2 2020-02-04 0.667
#> 9     2 2020-03-01 0.667

Our plotting code could then look something like this:

ggplot(plot_df, aes(vac_date, vacc)) +
  geom_step(aes(color = factor(dose))) +
  scale_y_continuous('Percent vaccinated', labels = scales::percent) +
  scale_x_date('Date', date_labels = '%d %b %Y', date_breaks = 'month') +
  scale_color_manual('Dose', values = c('navy', 'orangered')) +
  theme_minimal(base_size = 16)

This allows us to see what percentage of participants had their first and second vaccinations at any moment in time.

^{Created on 2023-10-19 with reprex v2.0.2}