I have a single data frame with information on many surgeons and their patients, for use in producing a Kaplan-Meier survival curve and conducting a Cox proportional hazard model analysis. The data includes a surgeon ID (sequential starting at 1), patient age, patient sex, status (0 = censored, 1 = event), and days between the index event (surgery) and the end event (reoperation) or censoring (patient died, moved away, etc.).
I would like to produce one new data frame for each surgeon to support my analysis, create a new variable ("SurgeonGroup") based on the surgeon's ID - the SurgeonGroup is either "You" for records with that surgeon's ID, or "Other Surgeons" for all other values - and save the new data frame sequentially (DataProvider1, DataProvider2, etc.) so each surgeon can be compared to their peers in the survival curve and hazard ratio analysis. For example, the SurgeonGroup variable will be used to compare the surgeon with their peers using the coxph function as follows:
coxph(Surv(Days, Status) ~ PatientAge + PatientSex + SurgeonGroup, data = DataProvider1) %>%
tbl_regression(exp = TRUE)
The following code produces a smaller sample data frame with only 5 surgeons, creates a simple function, and creates 5 different data frames for 5 different providers by calling that function 5 times. However, since my original data frame has many more surgeons, writing out the data frame assignment/function call statement for each one is clunky and has a risk of copy/paste errors.
Is there a simple way to repeat this "DataProviderX" <- MyFunction(X)" pattern for any similar dataset, producing the same number of new data frames as there are unique surgeons? I have searched for loop and apply function approaches that could be used in this case, but can't seem to make any work (iterations are not my strength in R). Any advice would be much appreciated!
Here is my replicable example:
# Load dplyr Package
library(dplyr)
# Create Sample Data Frame
Surgeon <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5)
PatientAge <- c(69,84,94,67,92,76,74,92,76,89,96,99,94,95,84,85,99,93,89,84,74,86,77,88,81,82,89,88,88,81,83,95,81,72,80,92,83,83,96,82,98,79,84,88,91,82,89,88,78,88)
PatientSex <- c("M","F","F","F","F","F","M","M","F","F","M","M","F","F","F","F","F","M","F","F","F","M","F","M","M","F","F","F","M","M","F","M","F","M","F","M","F","M","M","M","F","M","F","F","M","F","M","F","M","F")
Status <- c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
Days <- c(254,450,488,798,395,667,1836,220,3401,292,52,663,656,52,3797,1097,51,234,367,1641,1402,8,546,913,1849,2171,1474,312,2139,118,572,8,1175,2634,24,36,93,2627,312,1582,220,276,1329,135,116,933,2038,76,1018,1224)
Data <- data.frame(Surgeon, PatientAge, PatientSex, Status, Days)
# Create Function
MyFunction <- function(FunctionID) {
FunctionData <- Data %>% mutate(SurgeonGroup = case_when(Surgeon == FunctionID ~ "You",
TRUE ~ "Other Surgeons"))
return(FunctionData)
}
DataProvider1 <- MyFunction(1)
DataProvider2 <- MyFunction(2)
DataProvider3 <- MyFunction(3)
DataProvider4 <- MyFunction(4)
DataProvider5 <- MyFunction(5)
I am not familiar with coxph()
. So in a more general way, to include the modeling step, I would do this:
unique_ids <- unique(Data$Surgeon)
results <- lapply(unique_ids, function(id) {
# Create a data frame for a particular surgeon.
Data$SurgeonGroup <- ifelse(Data$Surgeon == id, "You", "OtherSurgeons")
# Run your model and save the output.
result <- model(outcome ~ predictor, data = Data)
# Reshape the result into a data frame. Many ways to do that, for example
# function glance() from package "broom" (https://broom.tidymodels.org/).
broom::glance(result)
})
# Bind all results into a single data frame.
dplyr::bind_rows(results)
If you would like to get up to speed on functional programming (e.g. the kind that makes heavy use of functions such as lapply()
) check out this chapter in Hadley's book: https://adv-r.hadley.nz/fp.html