Search code examples
rggplot2frequency

frequency or proportion plot by date or week with ggplot2


I am an epidemiologist and I am quite new to R. I have a simple vaccination data in long format which looks like:

data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,3),date=c("01/12/2020","02/12/2020","03/12/2020","04/12/2020",
"01/31/2020","03/12/2020","04/05/2020","02/12/2020","04/12/2020","05/12/2020","01/12/2020"),vac_date=c("","02/02/2020","","04/02/2020","","","04/01/2020","","04/01/2020","05/01/2020",""),dose=c('',1,'',2,'','',1,'',1,2,''))

id: patient's identification
date: survey date
vac_date: vaccination date
dose: indicating the vaccination dose

I am really having trouble creating the frequency line plot in my mind. I tried

ggplot(data, aes(x = date, y = vac_date)) + geom_line()

The dates and counts of vaccination are confusing. I would like to compute 2 plots:

  1. frequency or proportion plot by date or week regardless of dose
  2. frequency or proportion plot by date or week by dose (overlay) as shown in the following pic

https://imgur.com/KEoV4cR

Might someone please provide some help on getting the above plots? Thanks.


Solution

  • You have a few different problems here.

    • The first is that your data is in the wrong format. You cannot use empty strings in a numeric column in R to indicate missing data. That just turns the whole column into a character vector, meaning you cannot perform any maths operations on it. Instead, missing values in numeric or date columns should be labelled as NA.
    • Secondly, your dates are currently just character strings and need to be converted to actual dates.
    • Thirdly, the date column appears to be irrelevant here, assuming that the vacc_date column is accurate. What we are interested in is the proportion of all participants who were actually vaccinated by a given date. The survey date itself is not needed.

    To fix all of this, I would first start by defining a start date and end date over which we want to display our plot. We should also count the number of participants in our data set.

    participants <- length(unique(data$id))
    start_date <- as.Date('2020-01-01')
    end_date <- as.Date('2020-03-01')
    

    Now let's tidy up the data, removing the useless date column, and the useless rows with no vaccination information in them. We can then convert the dose and vac_date to the correct format, and calculate what proportion of participants were vaccinated by each date in the dataset:

    library(tidyverse) 
    
    plot_df <- data %>%
      select(-date) %>%
      filter(nzchar(vac_date)) %>%
      mutate(vac_date = lubridate::dmy(vac_date)) %>%
      mutate(dose = as.numeric(dose)) %>%
      group_by(dose) %>%
      arrange(dose, vac_date) %>%
      reframe(vac_date = c(start_date, vac_date, end_date),
              vacc = c(0, row_number(), n()) / participants)
    

    Now our data looks like this, with a column for the date of vaccination, a column for the percentage of participants who were vaccinated by that date, and a column indicating which dose of the vaccine they had received.

    plot_df
    #> # A tibble: 9 x 3
    #>    dose vac_date    vacc
    #>   <dbl> <date>     <dbl>
    #> 1     1 2020-01-01 0    
    #> 2     1 2020-01-04 0.333
    #> 3     1 2020-01-04 0.667
    #> 4     1 2020-02-02 1    
    #> 5     1 2020-03-01 1    
    #> 6     2 2020-01-01 0    
    #> 7     2 2020-01-05 0.333
    #> 8     2 2020-02-04 0.667
    #> 9     2 2020-03-01 0.667
    

    Our plotting code could then look something like this:

    ggplot(plot_df, aes(vac_date, vacc)) +
      geom_step(aes(color = factor(dose))) +
      scale_y_continuous('Percent vaccinated', labels = scales::percent) +
      scale_x_date('Date', date_labels = '%d %b %Y', date_breaks = 'month') +
      scale_color_manual('Dose', values = c('navy', 'orangered')) +
      theme_minimal(base_size = 16) 
    

    This allows us to see what percentage of participants had their first and second vaccinations at any moment in time.

    Created on 2023-10-19 with reprex v2.0.2