Search code examples
rdata.tabletransition

How to create a table that measures transitions of elements over calendar periods?


I have a transition table generating function which calculates the transitions of the states of elements over time elapsed since the element first appears ("Period_1" in example data frame below), output and code as shown immediately below:

library(data.table)

data <- 
  data.frame(
    ID = c(1,1,1,2,2,2,3,3,3),
    Period_1 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
    Period_2 = c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06","2020-02","2020-03","2020-04"),
    Values = c(5, 10, 15, 0, 2, 4, 3, 6, 9),
    State = c("X0","X1","X2","X0","X2","X0", "X2","X1","X0")
  )

numTransit <- function(x, from=1, to=3){
  setDT(x)
  unique_state <- unique(x$State)
  all_states <- setDT(expand.grid(list(from_state = unique_state, to_state = unique_state)))
  dcast(x[, .(from_state = State[from], 
              to_state = State[to]), 
          by = ID]
        [,.N, c("from_state", "to_state")]
        [all_states,on = c("from_state", "to_state")], 
        to_state ~ from_state, value.var = "N"
  )
}

numTransit(data,1,3)

However, in the fuller code this is deployed in, I'm also trying to give the user the option to calculate the transitions over calendar periods instead ("Period_2" in the data frame), where the output would look like the below if the user wants to see transitions from month 2020-02 to 2020-04 (since only one element, ID = 3, existed from period 2020-02 to 2020-04, only one element is shown in the resulting transition table; and that element moved from state X2 to state X0 during that period):

> numTransit(data,"2020-02","2020-04")
   to_state X0 X1 X2
1:       X0 NA NA 1
2:       X1 NA NA NA
3:       X2 NA NA NA

Any ideas how to do this? I'm new to data.table() but am committed to it due to speed, as this function is run against millions of rows of data and it generates results in a fraction of a second. This post is a follow-on expansion of the post How to convert a for-loop to lapply function for parallel testing purposes?


Solution

  • Here is one alternative definition of your numTransit function.

    (Updated: I moved the convert_to_matrix out of this function)

    num_transit <- function(x,from,to,refvar="Period_2", return_matrix=T) {
      res <- x[get(refvar) %in% c(to,from), if(.N>1) .SD, by=ID, .SDcols = c(refvar, "State")]
      res <- res[, id:=1:.N, by=ID]
      res <- dcast(res, ID~id, value.var="State")[,.N, .(`1`,`2`)]
      setnames(res,c("from","to", "ct"))
      if(return_matrix) return(convert_transits_to_matrix(res, unique(x$State)))
      res
    }
    
    convert_transits_to_matrix <- function(transits,states) {
      m = matrix(NA, nrow=length(states), ncol=length(states), dimnames=list(states,states))
      m[as.matrix(transits[,.(to,from)])] <- transits$ct
      m = data.table(m)[,to_state:=rownames(m)]
      setcolorder(m,"to_state")
      return(m[])
    }
    

    Usage:

    setDT(data)
    num_transit(data, "2020-02", "2020-04")
    
       to_state    X0    X1    X2
         <char> <int> <int> <int>
    1:       X0    NA    NA     1
    2:       X1    NA    NA    NA
    3:       X2    NA    NA    NA
    
    num_transit(data, 1,3, refvar="Period_1")
    
       to_state    X0    X1    X2
         <char> <int> <int> <int>
    1:       X0     1    NA     1
    2:       X1    NA    NA    NA
    3:       X2     1    NA    NA