Search code examples
rcumsum

Multivariate cummulative sum


Assume one wished to calculate a cumulative sum based on a multivariate condition, all(Z[i] <= x), for all i over a multivariate grid x. One may obviously implement this naively

cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
   for(j in seq(nrow(Z))){
        if(all(Z[j, ] <= x[i, ]))
            cSums[i] <- cSums[i] + R[j] # <== R is a single vector to be summed
   }
}

which would be somewhere around O((n*p)^2), or slightly faster by iteratively subsetting the columns

cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
    indx <- seq(nrow(Z))
    for(j in seq(ncol(Z))){
        indx <- indx[which(Z[indx, j] <= x[i, j])]
    }
    cSums[i] <- sum(R[indx])
}

but this still worst-case as slow as the naive-implementation. How could one improve this to achieve faster performance, while still allowing an undefined number of columns to be compared?

Dummy data and Reproducible example

var1 <- c(3,3,3,5,5,5,4,4,4,6)
var2 <- rep(seq(1,5), each = 2)
Z <- cbind(var1, var2)
x <- Z
R <- rep(1, nrow(x))
# Result using either method.
#[1] 2 2 3 4 6 6 5 5 6 10

Solution

  • We can use apply row-wise and compare every row with every other row and count how many of them satidy the criteria.

    apply(Z, 1, function(x) sum(rowSums(Z <= as.list(x)) == length(x)))
    #[1]  2  2  3  4  6  6  5  5  6 10
    

    Similar approach can also be performed using sapply + split

    sapply(split(Z, seq_len(nrow(Z))), function(x) 
                    sum(rowSums(Z <= as.list(x)) == length(x)))
    

    data

    var1 <- c(3,3,3,5,5,5,4,4,4,6)
    var2 <- rep(seq(1,5), each = 2)
    Z <- data.frame(var1, var2)