Search code examples
rcross-join

Efficient way to build data frame of (current state, next state) in R


I am working with a data set of patients' health state over time. I would like to compute the data frame of transitions from the current health state to the next health state.

Here is an example where the health state is measured only by AFP level and weight. The health state measurements might look like the following:

x <- data.frame(id = c(1, 1, 1, 2, 2, 2),
                day = c(1, 2, 3, 1, 2, 3),
                event = c('status', 'status', 'death', 'status', 'status', 'status'),
                afp = c(10, 50, NA, 20, 30, 40),
                weight = c(100, 105, NA, 200, 200, 200))

The desired output looks like the following:

y <- data.frame(id = c(1, 1, 2, 2),
                current_afp = c(10, 50, 20, 30),
                current_weight = c(100, 105, 200, 200),
                next_event = c('status', 'death', 'status', 'status'),
                next_afp = c(50, NA, 30, 40),
                next_weight = c(105, NA, 200, 200))

One inefficient way to obtain the output is:

  • take the cross product of the measurements data frame with itself
  • keep only rows with matching ids, and day.x + 1 = day.y
  • rename the columns

Is there a more efficient way to obtain the output?

Note: The real measurements data frame can have more than 10 columns, so it is not very efficient from a lines of code perspective to explicitly write

current_afp = x$afp[1:(n-1)],
next_afp = x$afp[2:n]
...

and so on.


Solution

  • You could try:

    library(dplyr)
    
    x %>%
      mutate_each(funs(lead(.)), -id, -day) %>%
      full_join(x, ., by = c("id", "day")) %>%
      select(-event.x) %>%
      setNames(c(names(.)[1:2], 
                 paste0("current_", sub("\\..*","", names(.)[3:4])), 
                 paste0("next_", sub("\\..*","", names(.)[5:7])))) %>%
      group_by(id) %>%
      filter(day != last(day))
    

    Which gives:

    #  id day current_afp current_weight next_event next_afp next_weight
    #1  1   1          10            100     status       50         105
    #2  1   2          50            105      death       NA          NA
    #3  2   1          20            200     status       30         200
    #4  2   2          30            200     status       40         200