Search code examples
rapplylapplysapply

Is nested list not suitable for manipulating single element in vector?


I have a following code:

for(i in 1:length(hh_temp)){
  hh_temp_save = hh_temp[[i]]
  for(j in 4:nrow(hh_temp_save)){
    hh_temp_save$max_min_sum_5days[j] = ifelse(sum(hh_temp_save$max_min_sum[(j-4):j])>2,1,0)
    hh_temp[[i]] = hh_temp_save
 }
}

where hh_temp is a list with length(hh_temp) = 12, each element in hh_temp is a dataframe.

I tried to convert the for-loop into nested apply but I found that

lapply(hh_temp,\(x){
  x = lapply(32:nrow(x),\(y){
             x$max_min_sum_5days[y] = ifelse(sum(x$max_min_sum[(y-4):y])>2,1,0)
             x
             })
  return(x)
               })

I can only return the manipulated vector instead of the whole dataset. Is there any way to return the whole dataset? Does it mean nested lapply is not suitable for manipulating single element in a vector?

I am sorry that I cannot provide the detail of dataset, some descriptive statistics can be provided:

> str(hh_temp)
List of 12
 $ : tibble [3,684 × 36] (S3: tbl_df/tbl/data.frame)
  ..$ max_min_sum                                      : num [1:3684] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ max_min_sum_5days                                : num [1:3684] NA NA NA NA NA NA NA NA NA NA ...
 $ : tibble [3,684 × 36] (S3: tbl_df/tbl/data.frame)
  ..$ max_min_sum                                      : num [1:3684] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ max_min_sum_5days                                : num [1:3684] NA NA NA NA NA NA NA NA NA NA ...

#repeated for 12 times
#max_min_sum is a binary variable

Sample data:

df = data.frame(a = as.factor(c(1,1,1,1,0,0,0,0,1,1,1,1,1,0,0,1,0,1)),
                b = rep(NA,18))

sample_list = list(df,df,df,df,df,df)

My expected outcome is to calculate the cumulative sum of the 5 consecutive elements in a and then if the consecutive sum is greater than 2, the corresponding element in b will be recoded as 1, otherwise 0.

a b
1 NA
1 NA
1 NA
1 NA
0 1
0 1
0 0

In the 5th element in a, since there are 4 1s and 1 0, therefore, the consecutive sum is greater than 2, the corresponding element in b will be recoded as 1.


Solution

  • My expected outcome is to calculate the cumulative sum of the 5 consecutive elements in a and then if the consecutive sum is greater than 2, the corresponding element in b will be recoded as 1, otherwise 0.

    If a is a factor variable, we need to run as.numeric(as.character(a)) beforehand to coerce a to numeric. We can use rollsum() from {zoo} for the rolling sum calulation.

    A solution using lapply() applied to slightly modified sample data.

    Code

    # sample_list = 
      lapply(sample_list, 
             \(x) { x$b = ifelse(
               zoo::rollsum(as.numeric(as.character(x$a)), 
                            k = 5, fill = NA, align = "right") > 2L, 1L, 0L)
             x})
    

    Or in a more compact way as suggested by @G. Grothendieck like

    lapply(sample_list, transform, 
           b = +(zoo::rollsumr(as.numeric(as.character(a)), k = 5L, fill = NA) > 2L))
    

    Result

    #> [[1]]
    #>    a  b
    #> 1  1 NA
    #> 2  1 NA
    #> 3  1 NA
    #> 4  1 NA
    #> 5  0  1
    #> 6  0  1
    #> 7  0  0
    #> 8  0  0
    #> 9  1  0
    #> 10 1  0
    #> 11 1  1
    #> 12 1  1
    #> 13 1  1
    #> 14 0  1
    #> 15 0  1
    #> 16 1  1
    #> 17 0  0
    #> 18 1  0
    #> 
    #> [[2]]
    #>    a  b
    #> 1  1 NA
    #> 2  2 NA
    #> 3  3 NA
    #> 4  4 NA
    #> 5  5  1
    #> 6  6  1
    #> 7  7  1
    #> 8  8  1
    #> 9  1  1
    #> 10 1  1
    #> 11 1  1
    #> 12 1  1
    #> 13 1  1
    #> 14 1  1
    #> 15 1  1
    #> 16 1  1
    #> 17 1  1
    #> 18 1  1
    #> 
    #> [[3]]
    #>    a  b
    #> 1  1 NA
    #> 2  1 NA
    #> 3  1 NA
    #> 4  1 NA
    #> 5  0  1
    #> 6  0  1
    #> 7  0  0
    #> 8  0  0
    #> 9  1  0
    #> 10 1  0
    #> 11 1  1
    #> 12 1  1
    #> 13 1  1
    #> 14 0  1
    #> 15 0  1
    #> 16 1  1
    #> 17 0  0
    #> 18 1  0
    #> 
    #> [[4]]
    #>    a  b
    #> 1  1 NA
    #> 2  1 NA
    #> 3  1 NA
    #> 4  1 NA
    #> 5  0  1
    #> 6  0  1
    #> 7  0  0
    #> 8  0  0
    #> 9  1  0
    #> 10 1  0
    #> 11 1  1
    #> 12 1  1
    #> 13 1  1
    #> 14 0  1
    #> 15 0  1
    #> 16 1  1
    #> 17 0  0
    #> 18 1  0
    #> 
    #> [[5]]
    #>    a  b
    #> 1  1 NA
    #> 2  1 NA
    #> 3  1 NA
    #> 4  1 NA
    #> 5  0  1
    #> 6  0  1
    #> 7  0  0
    #> 8  0  0
    #> 9  1  0
    #> 10 1  0
    #> 11 1  1
    #> 12 1  1
    #> 13 1  1
    #> 14 0  1
    #> 15 0  1
    #> 16 1  1
    #> 17 0  0
    #> 18 1  0
    #> 
    #> [[6]]
    #>    a  b
    #> 1  1 NA
    #> 2  1 NA
    #> 3  1 NA
    #> 4  1 NA
    #> 5  0  1
    #> 6  0  1
    #> 7  0  0
    #> 8  0  0
    #> 9  1  0
    #> 10 1  0
    #> 11 1  1
    #> 12 1  1
    #> 13 1  1
    #> 14 0  1
    #> 15 0  1
    #> 16 1  1
    #> 17 0  0
    #> 18 1  0
    

    Modified Data

    df = data.frame(a = as.factor(c(1,1,1,1,0,0,0,0,1,1,1,1,1,0,0,1,0,1)),
                    b = rep(NA,18))
    df2 = data.frame(a = as.factor(c(1:8, rep(1, 5), rep(1,5))), 
                     b = rep(NA,18))
    sample_list = list(df,df2,df,df,df,df)
    

    Created on 2023-12-08 with reprex v2.0.2


    Edit

    If your data is rather small and you do not want to rely on an external package like {zoo}, you might consider to write your own rolling sum function. Very basic example:

    basic_rollsum = \(x, k) {
      # stopifnot(is.numeric(x), is.integer(k))
      res = rep(NA, length(x))
      for (i in seq_along(x)) 
        # adjust indexing if needed 
        # look at na.rm-argument of sum
        if (i > k) res[i] = sum( x[(i-k+1L):(i)] )
      res
    }
    lapply(sample_list, 
           \(x) { x$b = ifelse(
             basic_rollsum(as.numeric(as.character(x$a)), k = 5L) > 2L, 1L, 0L)
           x})