Search code examples
rffffbase

Subsetting ffdf in loop


I'm trying to subset a very large ffdf object in a loop using ffbase, but I'm getting the error message:

Error in UseMethod("as.hi") : no applicable method for 'as.hi' applied to an object of
class "NULL"

I'm running this code on an ssh with large amounts of memory available. Here is the code I'm trying to run:

# totalD is an ffdf with columns ID, TS, and TD, each with 288,133,589 rows. ID consists
# of integers. TS is a column of integer timestamps with second precision. TD is of type
# double. Uid3 is an integer vector consisting of the 1205 unique entries of totalD$ID.

# H_times creates a matrix of the sum of the entries in TD traveled in each hour
H_times <- function(totalD, Uid3) {

    # hours is the number of unique hours of the experiment
    hours <- length(unique(subset(totalD$TS, totalD$TS %% 3600 == 0)))-1

    # bH is used as a counter in a the following loops
    bH <- min(unique(subset(totalD$TS, totalD$TS %% 3600 == 0)))

    # sum_D_matrix is the output
    sum_D_matrix <- matrix(0, nrow = hours, ncol = length(Uid3))

    for(i in 1:length(Uid3)) {
        Bh <- bH
        for(j in 1:hours) {
            sum_D_matrix[j,i] <- sum(subset(totalD$TD, totalD$TS >= Bh & totalD$TS < (Bh + 3600) & totalD$ID == Uid3[i]))
            Bh <- Bh + 3600
        }
    }
    save(sum_D_matrix, file = "sum_D_matrix)
}

H_times(totalD, Uid3)

I tried to implement the fix that jwijffels suggested in the comments of this question, but to no avail. Thanks in advance!


Solution

  • This is caused by the line:

    sum_D_matrix[j,i] <- sum(subset(totalD$TD, 
        totalD$TS >= Bh & totalD$TS < (Bh + 3600) & totalD$ID == Uid3[i]))
    

    Where the selection can be empty. One of the problems with ff is that it can not handle empty vectors. The size of an vector/ffdf should always be >= 1. Perhaps this should be handled by subset.ff. However, what subset.ff should then return is unclear.

    You can use the following work-around:

    sel <- totalD$TS >= Bh & totalD$TS < (Bh + 3600) & totalD$ID == Uid3[i]
    sel <- ffwhich(sel, sel)
    if (is.null(sel)) {
      sum_D_matrix[j,i] <- 0
    } else {
      sum_D_matrix[j,i] <- sum(totalD$TD[sel])
    }
    

    ffwhich returns NULL when the resulting vector would be empty (as I mentioned it cannot return an vector with length 0).

    Side note

    The way you are using subset is actually a bit strange. One of the reasons to use subset is to simplify notation by getting rid of all the totalD$ . The more 'usual' way of using it would be:

    sum_D_matrix[j,i] <- sum(subset(totalD, TS >= Bh & TS < (Bh + 3600) & ID == Uid3[i], 
        TD, drop=TRUE))