I have a collection of data frames, df_i
, representing the ith visit of a set of patients to a hospital. I'd like to summarize each of the data frames to determine the number of men, women and total patients at the ith visit. While I can solve this, my solution is clumsy. Is there a simpler way to get the final dataframe that I want? Example follows:
df_1 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2), "D"),
Dates = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-01-10"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2), "Female"),
Weight = seq(100, 190, 10),
Visit = rep(1, 10)
)
df_2 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2)),
Dates = seq.Date(from = as.Date("2020-02-01"), to = as.Date("2020-02-9"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2)),
Weight = seq(100, 180, 10),
Visit = rep(2, 5)
)
df_3 <- data.frame(
ID = c(rep("A",4), rep("B",3)),
Dates = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-03-07"), by = "day"),
Sex = rep("Male",7),
Weight = seq(140, 200, 10),
Visit = rep(3, 7)
)
I'm looking to generate the following result:
> df_sum
Visit Patients Men Women
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0
I can do this in a very clumsy way: First create a temporary data frame that summarizes the information in df_1
df_tmp <- df_1 %>%
group_by(ID) %>%
filter(Dates == min(Dates)) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
> df_tmp
# A tibble: 4 x 4
ID n Men Women
<chr> <int> <int> <int>
1 A 1 1 0
2 B 1 1 0
3 C 1 0 1
4 D 1 0 1
Next, sum each of the columns in df_tmp
to create the first row for the summary column.
r1 <- c(sum(df_tmp$n), sum(df_tmp$Men), sum(df_tmp$Women))
Repeat for the second and third data frames. Finally rbind the rows together to create the summary data frame. While this works, it is extremely clumsy, and doesn't generalize to the case when I have a variable number of visits. Would someone kindly point me to a mmore elegant solution to my problem?
Many thanks in advance
Thomas Philips
Could also make into a tibble with bind_rows
:
library(tidyverse)
bind_rows(df_1, df_2, df_3, .id = "day") %>%
group_by(day, ID) %>%
slice_min(Dates) %>%
group_by(day) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
Result
# A tibble: 3 x 4
day n Men Women
* <chr> <int> <int> <int>
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0