Search code examples
rif-statementdata-manipulationsurvival-analysis

Extracting specific rows from long format dataset by conditions / adopting long format data to survival analysis


Background:

I have a dataset that I am preparing for a survival analysis, it's originally a longitudinal dataset in long format. I have an ID variable separating participants, a time variable (months), and my binary 0/1 event variable (whether or not somebody met a "monthly loss limit" when gambling).

Problem/goal:

I am trying to create the necessary variables for the survival analysis and then remove the excess/unnecessary rows. My event (meeting a loss limit) can technically occur multiple times for each participant across the study period, but I am only interested in the first occurrence for a participant. I have made a time duration variable and attempted to modify it with an if-else statement so that participants that meet a loss limit have that specific month as their endpoint.

The problem is that I can't seem to do the filtering in a way that I only keep the rows that I want. I have attempted some code with an if-else statement but I am getting an error. For participants that have met one or more loss limits I want to extract the row with their first loss limit met because the modified time duration is also contained within this row. For participants that never reach a loss limit I doesn't matter, any row is fine because they all have the necessary information.

How do I accomplish this?

Example data frame and code:

library(dplyr)
# Example variables and data frame in long form
# Includes id variable, time variable and example event variable
id <- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3 )
time <- c(2, 3, 4, 7, 3, 5, 7, 1, 2, 3, 4, 5)
metLimit <- c(0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1)

dfLong <- data.frame(id = id, time = time, metLimit = metLimit)

# Making variables, time at start, finish and duration variable 
dfLong <- dfLong %>% 
  group_by(id) %>% 
  mutate(startTime = first(time),
         lastTime = last(time))
dfLong <- dfLong %>% 
  group_by(id) %>% 
  mutate(timeDuration = ifelse(metLimit == "1", c(time - startTime), 
                               lastTime - startTime))
# My failed attempt at solving the problem
dfLong <- dfLong %>% 
  group_by(id) %>% 
  ifelse(metLimit == "1", filter(first(metLimit)), filter(last(time)

Solution

  • You could sort the idgroups:

    dfLong %>% 
      group_by(id) %>% 
      arrange(desc(metLimit),time,.by_group=TRUE) %>%
      # This one is critical, order by metlimit descending first
      # (MetLimit==1 will be in the first rows of the group if it exists for this
      # particular id) then order by time:
      # Within every Group of id,MeTlimit , put the lowest tim in the upper row
      # of the id Group
      slice_head(n=1) # get the first row for each id-group
    

    This results in:

    # A tibble: 3 x 6
    # Groups:   id [3]
         id  time metLimit startTime lastTime timeDuration
      <dbl> <dbl>    <dbl>     <dbl>    <dbl>        <dbl>
    1     1     2        0         2        7            5
    2     2     5        1         3        7            2
    3     3     2        1         1        5            1
    

    As you do not care about the samplepoint of participants that have never reached their limit, this should be sufficient.