Search code examples
rdata.tabletidyversedtplyr

data.table::frollapply does not work as expected when the input is a column of lists


I am having problems with frollapply from the data.table package. I have a dataset with a target_var column that consists of lists of equal length containing either numeric entries or NAs. I want to calculate the number of unique entries in these lists in a sliding window of length 2.

Surprisingly, the function slides through the sequence NA, 1, NA, 2, NA, 3, NA, 4, NA, 5, NA, 6 in steps of two starting with NA, 1. To check this uncomment the lines in the FUN parameter.

# Packages
lapply(c("data.table","dplyr","tibble","dtplyr"), library, character.only = TRUE)

# Test data
dummy_data <- tribble(
  ~date,         ~target_var,
  "2022-10-20",  as.double(list(NA, NA , NA)),
  "2022-10-21",  as.double(list(NA, 1 , NA)),
  "2022-10-22",  as.double(list(2,  NA, 3)),
  "2022-10-23",  as.double(list(NA, 4, NA)),
  "2022-10-24",  as.double(list(5,  NA, 6))
) 

# Sliding window
dummy_data %>%
  lazy_dt() %>% 
  mutate(new_var = data.table::frollapply(
    x = target_var,
    n = 2,
    align = "right",
    FUN = function(x){
      # browser()
      # print(x)
      x %>%
        unlist(recursive = FALSE,
               use.names = FALSE) %>% 
        n_distinct(na.rm = TRUE)
      }
    )) %>%
  as_tibble()

# Expected results
expected_res <- tribble(
  ~date,         ~target_var,                  ~new_var,
  "2022-10-20",  as.double(list(NA, NA , NA)), NA,
  "2022-10-21",  as.double(list(NA, 1 , NA)),  1,
  "2022-10-22",  as.double(list(2,  NA, 3)),   3,
  "2022-10-23",  as.double(list(NA, 4, NA)),   3,
  "2022-10-24",  as.double(list(5,  NA, 6)),   3
) 

However, I expected the sliding window to slide through the rows of the dataset, starting with NA, 1, NA, 2, NA, 3, i.e. the first two lists unpacked. However, I am not sure if frollapply can combine the two lists within the sliding window or what happens exactly.

Another issue is that the new variable is a list and not a single number, which is also unexpected.

The inner FUN works as expected when by-passing frollapply.

dummy_data$target_var %>% 
  unlist(recursive = FALSE,
         use.names = FALSE) %>% 
  n_distinct(na.rm = TRUE)

I have thought about concatenating the entries rather than creating a list, but the string processing steps turned out to be very inefficient. Does anyone have any idea why frollapply doesn`t work as expected in this context or what I am missing here?


Solution

  • With data.table : first use sapply to sum on each row, then use frollsum

    setDT(dummy_data)[,new_var:=frollsum(sapply(target_var,function(x) sum(!is.na(x))),2)][]
    
             date target_var new_var
           <char>     <list>   <num>
    1: 2022-10-20   NA,NA,NA      NA
    2: 2022-10-21   NA, 1,NA       1
    3: 2022-10-22    2,NA, 3       3
    4: 2022-10-23   NA, 4,NA       3
    5: 2022-10-24    5,NA, 6       3
    

    Note that first line will always be NA because partial sum isn't implemented.
    If you need a partial rollsum, a workaround is to use and adaptive=TRUE window.