Search code examples
rfor-looprecursiondplyrsapply

How to a create recursive variable per group in data frame?


I have a data set with multiple observations of a clinical outcome for each patient. The length between these time points is variable. I want to create a "roving baseline score" variable for the clinical outcome, that changes every time an increase or a decrease of the score is confirmed at the first time point that is at least 90 days ahead.

I have created an example data frame with observations for two patients (patient 1 has 6 observations, patient 2 only has 2 observations):

library(tidyverse)

df <- tibble(
patID  = c("1", "1", "1", "1", "1", "1", "2", "2", "3", "3", "3", "3", "3", "3"),     
time_point = c("1", "2", "3", "4", "5", "6", "1", "2", "1", "2", "3", "4", "5", "6"),
date = as.Date(c("2020-01-01", "2020-05-01", "2020-06-01", "2020-09-01", "2021-01-01", "2021-05-01", "2020-01-01", "2020-05-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-06-01", "2020-10-01", "2021-04-01")),
score = c("300", "100", "100", "100", "200", "200", "600", "400", "300", "200", "200", "100", "400", "300"))

The vector with the baseline score should result in the following:

baseline = c("300", "300", "300", "100", "100", "200", "600", "600", "300", "300", "300", "200", "200", "300") 
  • Explanation: the score of patient 1 decreases at time point 2 from 300 to 100. This decrease can be confirmed at time point 4, since this is the closest time point that is at least 90 days ahead. Thus, the new baseline at time point 4 is 100. At time point 6, an increase in the score (with respect to the latest baseline) is confirmed and we obtain a new baseline of 200. Patient 2 only has two time points and therefore a change in the baseline cannot be confirmed.

I have tried creating the variable using a for loop, but I don't know how to make it such that it is done for each patient separately.

EDIT: I tried to write the code as following using sapply, but I get the error "argument is of length 0"

df_baseline <- df %>%
group_by(patID) %>%
mutate(baseline = lag(score, default = first(score)),
     closest_index_90 = sapply(1:n(), function(i) {
       valid_indices <- which(date[i]+90 <= date[(i+1):n()])
       if (length(valid_indices)>0) {return(first(valid_indices) + i)} 
       else {return(NA)}
     }),
     baseline = sapply(1:n(), function(i) {
       if (score[i:closest_index_90[i]] > baseline[i - 1] | score[i:closest_index_90[i]] < baseline[i - 1]) {return(score[which.min(abs(score[i:closest_index_90[i]]-baseline[i-1]))])}
     else {return(baseline[i-1])}}))

I am relatively new to R so my skills with for loops and sapply functions are limited. I have looked through many questions asked on here but none seemed to answer my question specifically. Any help would be greatly appreciated!


Solution

  • Found a solution using purrr::accumulate2

    df %>%
    group_by(patID) %>%
    mutate(
    closest_index_90 = sapply(1:n(), function(i) {
       valid_indices <- which(date[i]+90 <= date[(i+1):n()])
       if (length(valid_indices)>0) {return(first(valid_indices) + i)} 
       else {return(NA)}
     }),
    baseline = accumulate2(
       time_point,
       replace_na(closest_index_90, max(i)),
       \(b, i, r) {
       if (b>max(score[i:r])) {max(score[i:r]}
       else if (b<min(score[i:r])) {min(score[i:r]}
       else b
       }, .init = score[[1]]
       )[-1])