Search code examples
rzoorolling-computationrollapply

zoo::rollapply window with over column values rather than rows


dat = structure(list(index = c(10505L, 10506L, 10511L, 10539L, 10542L, 
10579L, 10642L, 11008L, 11012L, 13011L, 13110L, 13116L, 13118L, 
13156L, 13259L, 13273L, 13313L, 13365L, 13380L, 13382L, 13445L, 
13453L, 13482L, 13483L, 13494L, 13543L, 13550L, 14462L, 14464L, 
14564L, 14599L, 14604L, 14674L, 14719L, 14728L, 14775L, 14860L, 
14874L, 14930L, 14933L, 14975L, 15031L, 15089L, 15117L, 15179L, 
15211L, 15241L, 15245L, 15255L, 15260L, 15418L, 15585L, 15627L, 
15644L, 15774L, 15776L, 15777L, 15790L, 15791L, 15833L, 15849L, 
15850L, 15886L, 16042L, 16127L, 16140L, 16141L, 16142L, 16365L, 
16485L, 16489L, 16515L, 16542L, 16738L, 16834L, 16949L, 17272L, 
17462L, 17569L, 17571L, 17641L, 17654L, 17694L, 17695L, 17709L, 
17748L, 17836L, 17922L, 18643L, 20113L, 20131L, 28914L, 29318L, 
30524L, 30741L, 30912L, 30923L, 30998L, 46650L, 46698L), V2 = c(3L, 
3L, 3L, 2L, 2L, 2L, 2L, 1L, 0L, 3L, 2L, 2L, 2L, 0L, 1L, 1L, 0L, 
0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 
0L, 0L, 1L, 2L, 2L, 2L, 2L, 1L, 0L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 
0L, 0L, 0L, 2L, 3L, 5L, 3L, 0L, 0L, 3L, 1L, 0L, 3L, 0L, 0L, 2L, 
1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 
1L, 1L, 1L)), row.names = c(NA, -100L), class = "data.frame")

Let's say I want to calculate a function across dat in a rolling window.

n_sites = function(x) {
    return(sum(x > 1))
}
zoo::rollapply(dat$V2, FUN=n_sites, width=100)

However, rather than using the number of rows as the window size, I'd like to use the actual numeric values in the index column. So I would like to have each window so that it includes roughly 100 units in the index column. Given there is roughly 100 units of index between the 1st and 7th rows, the first window would include these rows. Is this possible?

Happy for a solution using zoo or data.table or the like.


Solution

  • The width in rollapply can be a vector such that the i-th element is the width to use for the i-th row. There are a number of interpretations of the question. We could use the largest width covering no more than 100 index units, the smallest width at least 100 index units or the closest width to 100 index units. The question seems to ask for the third but the example width of 7 is not consistent with that and suggests that maybe the second interpretation is wanted. We give all three widths at the end. Pick whichever you want. Also the question says that the first window is 7 so that indicates that left alignment is wanted.

    library(zoo)
    
    w <- w2 # see calcs of w1, w2 and w3 at end.  Use whichever you want.
    transform(dat, roll = rollapplyr(V2, w, n_sites, fill = NA, align = "left"))
    

    If n_sites is just a stand-in for the actual function then we can use the above but if it is the actual function we can eliminate it and write it like this:

    transform(dat, roll = rollapplyr(V2 > 1, w, sum, fill = NA, align = "left"))
    

    Width

    Many variations of this are possible and we compute the three mentioned here.

    The code below use base R's findInterval. Recall that findInterval(x, vec), where x and vec are vectors and vec is non-decreasing, returns a vector the same length as x such that the ith component of the result is sum(x[i] >= vec) but does it more efficiently. That is, if x[i] is found in vec then it finds the last position in vec that equals x[i] and if x[i] is not in vec then it finds the last position in vec that is less than x[i]. Note that it returns positions, i.e. indexes, not values of vec. For example, findInterval(c(20, 30), c(10, 30, 30, 30, 40)) returns c(1, 4) since 1 is the position of the largest value in vec less than 20 and 4 is the position of the last value in vec equal to 30.

    n <- nrow(dat)
    index <- dat$index
    
    # i1 is row number of last index no more than current index + 100
    i1 <- findInterval(index + 100, index)
    w1 <- i1 - 1:n + 1
    
    # i2 is row number of first index at least equal to index + 100
    i2 <- pmin(findInterval(index + 100 - 1, index) + 1, n)
    w2 <- i2 - 1:n + 1
    w2[1]
    ## [1] 7
    
    # i is row number of index closest to current index + 100
    i <- ifelse(index + 100 - index[i1] <= index[i2] - (index + 100), i1, i2)
    w3 <- i - 1:n + 1