Search code examples
r

Cumulative Percent's Across Multiple Groups


Suppose there is a hospital clinic that has a list of how many patients visited the hospital every day. I have data over 10 years - but patients don't visit the clinic everyday. To give an example, the data looks like this (in R):

library(dplyr)

set.seed(123)

start_date <- as.Date("2010-01-01")
end_date <- as.Date("2019-12-31")
all_dates <- seq.Date(start_date, end_date, by="day")

num_visits <- sample(1:length(all_dates), size = 3000, replace = FALSE)
visit_dates <- all_dates[num_visits]

num_patients <- sample(1:100, size = length(visit_dates), replace = TRUE)

clinic_data <- data.frame(date = visit_dates, num_patients = num_patients)

hospital_data <- clinic_data %>% arrange(date)

       date num_patients
 2010-01-01           90
 2010-01-02           96
 2010-01-04           65
 2010-01-05           80
 2010-01-06           15
 2010-01-07           87

I want to try and answer the following question: On average - for any given month, what percent of all patients that month will have visited the clinic by day $y$? For example, suppose in some month I know that 900 people visited the hospital - I want to know that by the 19th, what percent (cumulatively) of these 900 had likely visited the hospital BY THEN based on previous trends?

I tried to do this by manually identifying different logical steps:

library(ggplot2)

hospital_data$year <- as.numeric(format(as.Date(hospital_data$date), "%Y"))
hospital_data$month <- as.numeric(format(as.Date(hospital_data$date), "%m"))
hospital_data$day <- as.numeric(format(as.Date(hospital_data$date), "%d"))

hospital_data <- hospital_data[order(hospital_data$date), ]

yearly_totals <- aggregate(num_patients ~ year, data = hospital_data, FUN = sum)
names(yearly_totals)[2] <- "yearly_total"

hospital_data <- merge(hospital_data, yearly_totals, by = "year")

results <- by(hospital_data, hospital_data$year, function(df) {
    df$cumulative_patients <- cumsum(df$num_patients)
    df$cumulative_percentage <- df$cumulative_patients / df$yearly_total * 100
    return(df)
})
results <- do.call(rbind, results)

avg_results <- aggregate(cumulative_percentage ~ day, data = results, FUN = mean, na.rm = TRUE)

avg_results <- avg_results[order(avg_results$day), ]

ggplot(avg_results, aes(x = day, y = cumulative_percentage)) +
    geom_line() +
    geom_point() +
    scale_x_continuous(breaks = seq(1, 31, by = 5)) +
    scale_y_continuous(limits = c(0, 100)) +
    labs(title = "Average Cumulative Percentage of Yearly Patients by Day",
         x = "Day of Month",
         y = "Average Cumulative Percentage of Patients") +
    theme_minimal() +
    theme(panel.grid.minor = element_blank())

But my graph is not displaying this cumulative percent:

enter image description here

Does someone have ideas where I am messing this up?

EDIT:

library(tidyverse)

result <- hospital_data %>%
  mutate(month = floor_date(date, "month"),
         day = day(date)) %>%
  group_by(month) %>%
  arrange(month, day) %>%
  mutate(month_total = sum(num_patients),
         cuml = cumsum(num_patients),
         cuml_pct = cuml / month_total) %>%
  ungroup() %>%
  group_by(day) %>%
  summarize(avg_cuml_pct = mean(cuml_pct, na.rm = TRUE)) %>%
  arrange(day)

result <- result %>%
  mutate(avg_cuml_pct = cummax(avg_cuml_pct))

ggplot(result, aes(day, avg_cuml_pct)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0, 31, by = 5)) +
  labs(x = "Day of Month", 
       y = "Average Cumulative Percentage of Monthly Patients",
       title = "Average Cumulative Patient Percentage by Day of Month") +
  theme_minimal()

Solution

  • Perhaps something like this? Each of the light gray lines is the cumulative percentage of patients through each month, by day. The dark line is an unweighted average of those averages. You might want a weighted average, but it's not much of a difference here given the many months with similar scale.

    library(tidyverse)
    hospital_data |>
      mutate(month = floor_date(date, "month"),
             day = day(date)) |>
      mutate(cuml = cumsum(num_patients),
             cuml_pct = cuml / sum(num_patients), .by = month) |>
      ggplot(aes(day, cuml_pct)) +
      geom_line(aes(group = month), alpha = 0.1) +
      geom_line(data = ~summarize(., cuml_pct = mean(cuml_pct), .by = day))
    

    enter image description here

    Or we could do the same on a weighted basis, but note that since some months have 31 days, this will suggest we need until the 31st of any month (even those with 28/29/30 days) to get to 100%.

    hospital_data |>
      mutate(day = day(date)) |>
      count(day, wt = num_patients) |> # = summarize(n = sum(num_patients), .by = day)
      arrange(day) |>
      mutate(cuml_pct = cumsum(n)/sum(n)) |>
      ggplot(aes(day, cuml_pct)) +
      geom_line()
    

    enter image description here