Search code examples
rvectoralgebra

R: Compute on previous elements of an ordered vector


Given an ordered vector vec <- c(1, 4, 6, 3, 2, 7), I want to compute for each element i of vec the weighted average of the previous elements where the weight is the inverse of the distance from the element i.

The function should proceed as following.

  • For the first element 1, should return NA (no previous element).
  • For the second element 4, should return 1.
  • For the third element 6, should return weighted.mean(x = c(1,4), w = c(1,2)).
  • For the fourth element 3, should return weighted.mean(x = c(1,4,6), w = c(1,2,3))

The resulting vector result should be, with length(result) == length(vec), c(NA, 1, 3, 4.5, 3.9, 3.266667).

UPDATE: I clearly mean without using a loop

result <- numeric()

for (i in 1:length(vec)) {
  if (i == 1) {
    result <-
      c(result, NA)
  } else {
    previous_elements <- vec[1:(i-1)]
    result <-
      c(result, 
        weighted.mean(x = previous_elements, w = 1:length(previous_elements)))
  }
}

Solution

  • Here's a naive implementation. Create a function that does what you say; the only 'clever' thing is to use the function seq_len() instead of 1:i to generate the indexes

    fun = function(i, vec)
        weighted.mean(head(vec, i - 1), w=seq_len(i - 1))
    

    and then use it in sapply

    sapply(seq_along(vec), fun, vec)
    

    This is good enough -- NaN as the first element, rather than NA, but that's easily corrected after the fact (or conceptually accepted as the right answer). It's also better than your solution, but still 'using a loop' -- the management of the result vector is done by sapply(), rather than in your loop where you have to manage it yourself. And in particular your 'copy and append' approach is very bad performance-wise, making a copy of the existing result each time through the loop. It's better to pre-allocate a result vector of the appropriate length result = numeric(length(vec)) and then fill it result[[i]] = ..., and better still to just let sapply() do the right thing for you!

    The problem is that the naive implementation scales quadratically -- you make a pass along vec to process each element, and then for each element you make a second pass to calculate the weighted mean, so there are n (n - 1) / 2 calculations. So...

    Take a look at weighted.mean

    > stats:::weighted.mean.default
    function (x, w, ..., na.rm = FALSE) 
    {
        ## SNIP -- edited for brevity
        w <- as.double(w)
        if (na.rm) {
            i <- !is.na(x)
            w <- w[i]
            x <- x[i]
        }
        sum((x * w)[w != 0])/sum(w)
    }
    

    and use cumsum() instead of sum() to get the cumulative weights, rather than the individual weights, i.e., return a vector as long as x, where the ith element is the weighted mean up to that point

    cumweighted.mean <- function(x, w) {
        ## handle NA values?
        w <- as.numeric(w)  # to avoid integer overflow
        cumsum(x * w)[w != 0] / cumsum(w)
    }
    

    You'd like something a little different

    myweighted.mean <- function(x)
        c(NA, cumweighted.mean(head(x, -1), head(seq_along(x), - 1)))
    

    This makes a single pass through the data, so scales linearly (at least in theory).