Search code examples
rlistdata.tablestring-lengthrollapply

How to apply a function per row of a column in a data table with other rows as input?


For each row of column "Response", I would like to check if the 5 rows below it have "Response"-values (i.e. have no NAs) and if so, then I would like to calculate the mean and standard deviation of those 5 rows below. If any row, in those 5 rows below, has a missing "Response"-value (i.e. NA), then the final output should be "NA" (since I want the means and stdev to be calculated for n=5 points/values).

A sample of the Input.data looks like this:

 Response     
        NA               
         1                 
         2                 
         3                
        NA        
         1         
         1         
         2         
         3         
         4         
         5    

Here is the code I tried, which did not give the right solution:

Input.data$count.lag <- rollapplyr(Input.data[,c("Response")],list(-(4:0)),length, fill=NA)

Input.data$stdev <- ifelse(Input.data$count.lag <5, "NA", 
                            rollapplyr(Input.data[,c("Response")],list(-(4:0)),sd,fill=NA))
Input.data$mean <- ifelse(Input.data$count.lag <5, "NA", 
                           rollapplyr(Input.data[,c("Response")],list(-(4:0)),mean,fill=NA))

it gave the following, which was not what I am after:

 Response count.lag     stdev mean
       NA        NA        NA   NA
        1        NA        NA   NA
        2        NA        NA   NA
        3        NA        NA   NA
       NA         5        NA   NA
        1         5        NA   NA
        1         5        NA   NA
        2         5        NA   NA
        3         5        NA   NA
        4         5  1.303840  2.2
        5         5  1.581139  3.0

This is how the output should have been:

Response count.lag      stdev  mean
     NA         4        NA    NA
      1         4        NA    NA
      2         4        NA    NA
      3         4        NA    NA
     NA         5   1.303840   2.2
      1         5   1.581139   3.0
      1         5   1.581139   4.0
      2         5   1.581139   5.0
      3         5   1.581139   6.0
      4         5   1.581139   7.0
      5         5   1.581139   8.0

Can someone please suggest where the errors are and/or an alternative solution that works? Thank you!


Solution

  • A possible approach:

    Input[, c("count.lag","stdev","mean") := 
        transpose(lapply(1L:.N, function(n) {
            x <- Response[(n+1L):min(n+5L, .N)]
            c(sum(!is.na(x)), sd(x), mean(x))
        }))]
    

    output:

        Response count.lag     stdev mean
     1:       NA         4        NA   NA
     2:        1         4        NA   NA
     3:        2         4        NA   NA
     4:        3         4        NA   NA
     5:       NA         5 1.3038405  2.2
     6:        1         5 1.5811388  3.0
     7:        1         5 1.5811388  4.0
     8:        2         5 1.5811388  5.0
     9:        3         5 1.5811388  6.0
    10:        4         5 1.5811388  7.0
    11:        5         5 1.5811388  8.0
    12:        6         4 1.2909944  8.5
    13:        7         3 1.0000000  9.0
    14:        8         2 0.7071068  9.5
    15:        9         1        NA 10.0
    16:       10         1        NA   NA
    

    data:

    Input <- fread("Response     
    NA               
    1                 
    2                 
    3                
    NA        
    1         
    1         
    2         
    3         
    4         
    5
    6
    7
    8
    9
    10")
    

    edit: Or as per MichaelChirico's suggestion using shift. The ending values are different and depends on how OP wants the ending values to be handled.

    #requires data.table version >= 1.12.0 to use negative shifts (else use type='lag' with positive integers
    Input[, c("count.lag", "stdev", "mean") := 
        .SD[, shift(Response, -1L:-5L)][, 
            .(apply(.SD, 1L, function(x) sum(!is.na(x))), 
                apply(.SD, 1L, sd), 
                apply(.SD, 1L, mean))]
    ]
    

    output:

        Response count.lag    stdev mean
     1:       NA         4       NA   NA
     2:        1         4       NA   NA
     3:        2         4       NA   NA
     4:        3         4       NA   NA
     5:       NA         5 1.303840  2.2
     6:        1         5 1.581139  3.0
     7:        1         5 1.581139  4.0
     8:        2         5 1.581139  5.0
     9:        3         5 1.581139  6.0
    10:        4         5 1.581139  7.0
    11:        5         5 1.581139  8.0
    12:        6         4       NA   NA
    13:        7         3       NA   NA
    14:        8         2       NA   NA
    15:        9         1       NA   NA
    16:       10         0       NA   NA