Search code examples
rdataframesummarize

Summarizing a collection of data frames - improving upon a clumsy solution


I have a collection of data frames, df_i, representing the ith visit of a set of patients to a hospital. I'd like to summarize each of the data frames to determine the number of men, women and total patients at the ith visit. While I can solve this, my solution is clumsy. Is there a simpler way to get the final dataframe that I want? Example follows:

df_1 <- data.frame(
  ID     = c(rep("A",4), rep("B",3), rep("C",2), "D"),
  Dates  = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-01-10"), by = "day"),
  Sex    = c(rep("Male",4), rep("Male",3), rep("Female",2), "Female"),
  Weight = seq(100, 190, 10),
  Visit  = rep(1, 10)
)

df_2 <- data.frame(
  ID     = c(rep("A",4), rep("B",3), rep("C",2)),
  Dates  = seq.Date(from = as.Date("2020-02-01"), to = as.Date("2020-02-9"), by = "day"),
  Sex    = c(rep("Male",4), rep("Male",3), rep("Female",2)),
  Weight = seq(100, 180, 10),
  Visit  = rep(2, 5)
)

df_3 <- data.frame(
  ID     = c(rep("A",4), rep("B",3)),
  Dates  = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-03-07"), by = "day"),
  Sex    = rep("Male",7),
  Weight = seq(140, 200, 10),
  Visit  = rep(3, 7)
)

I'm looking to generate the following result:

> df_sum
  Visit Patients Men Women
1     1        4   2     2
2     2        3   2     1
3     3        2   2     0

I can do this in a very clumsy way: First create a temporary data frame that summarizes the information in df_1

df_tmp <- df_1 %>%
            group_by(ID) %>%
            filter(Dates == min(Dates)) %>%
            summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
> df_tmp
# A tibble: 4 x 4
  ID        n   Men Women
  <chr> <int> <int> <int>
1 A         1     1     0
2 B         1     1     0
3 C         1     0     1
4 D         1     0     1

Next, sum each of the columns in df_tmp to create the first row for the summary column.

r1 <- c(sum(df_tmp$n), sum(df_tmp$Men), sum(df_tmp$Women))

Repeat for the second and third data frames. Finally rbind the rows together to create the summary data frame. While this works, it is extremely clumsy, and doesn't generalize to the case when I have a variable number of visits. Would someone kindly point me to a mmore elegant solution to my problem?

Many thanks in advance

Thomas Philips


Solution

  • Could also make into a tibble with bind_rows:

    library(tidyverse)
    bind_rows(df_1, df_2, df_3, .id = "day") %>%
      group_by(day, ID) %>%
      slice_min(Dates) %>%
      group_by(day) %>%
      summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
    

    Result

    # A tibble: 3 x 4
      day       n   Men Women
    * <chr> <int> <int> <int>
    1 1         4     2     2
    2 2         3     2     1
    3 3         2     2     0