I am working in R
and I want to calculate the last column Annual_Cases
based on the available years for each surgeon. The variable Years
was reported per each 3-year time interval. Some surgeons were working in more than 1 hospital.
Here is my code that needs edits and my data
library(dplyr);library(tidyr)
df %>% drop_na(Cases) %>% group_by(Surgeon)%>% summarise(Annaul_Cases= sum(Cases))/n, na.rm=T)-> df # I need to calculate n column (number of years the surgeon were in service)
structure(list(Serial.ID = c(215, 522, 903, 1210, 1591, 1898,
2279, 2586, 2967, 3274, 3655, 3962, 4343, 4650, 5031, 5338, 330,
1018, 1706, 2394, 3082, 3770, 4458, 5146), Surgeon = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "B", "B", "B", "B", "B", "B", "B", "B"), Center = c("Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center",
"St. Vincents Hospital and Medical Center", "Mount Sinai Hospital",
"Mount Sinai Hospital", "Mount Sinai Hospital", "Mount Sinai Hospital",
"Mount Sinai Hospital", "Mount Sinai Hospital", "Mount Sinai Hospital",
"Mount Sinai Hospital"), Years = c("1996-1998", "1996-1998",
"1999-2001", "1999-2001", "2002-2004", "2002-2004", "2005-2007",
"2005-2007", "2008-2010", "2008-2010", "2011-2013", "2011-2013",
"2014-2016", "2014-2016", "2017-2019", "2017-2019", "1996-1998",
"1999-2001", "2002-2004", "2005-2007", "2008-2010", "2011-2013",
"2014-2016", "2017-2019"), Cases = c(377, 19, 223, NA, 27, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 110, 18, 13,
17, 4, 1), Deaths = c("10", "1", "6", "NA", "0", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"1", "0", "0", "0", "0", "0"), Annual_Cases = c("NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA")), row.names = c(NA,
-24L), class = c("tbl_df", "tbl", "data.frame"))
Here is my desired output and my used manual formula for that.
Any advice will be greatly appreciated
First summarize to get sums for each year span, then summarize again to aggregate over all year spans.
library(dplyr)
library(tidyr)
df %>%
drop_na(Cases) %>%
summarize(Cases = sum(Cases), .by = c(Surgeon, Years)) %>%
summarize(Annual_Cases = sum(Cases) / (n() * 3), .by = Surgeon)
Result:
# A tibble: 2 × 2
Surgeon Annual_Cases
<chr> <dbl>
1 A 71.8
2 B 9.06
This solution assumes each year span is always 3 years. If not, you’ll need to use an approach like @JonSpring’s using separate()
.