Search code examples
rbigdataapplyff

Using apply on large ffdfs


The basic idea is this: I have a large ffdf (about 5.5 million x 136 fields). I know for a fact that some of these columns in this data frame have columns which are all NA. How do I find out which ones and remove them appropriately?

My instinct is to do something like (assuming df is the ffdf):

apply(X=is.na(df[,1:136]), MARGIN = 2, FUN = sum)

which should give me a vector of the NA counts for each column, and then I could find which ones have ~5.5 million NA values, remove them using df <- df[,-c(vector of columns)], etc. Pretty straightforward.

However, apply gives me an error.

Error: cannot allocate vector of size 21.6 Mb
In addition: Warning messages:
1: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)
2: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)
3: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)
4: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)

This tells me that apply can't handle a data frame of this size. Are there any alternatives I can use?


Solution

  • It is easier to use all(is.na(column)). sapply/lapply donot work because and ffdf object is not a list.

    You use df[, 1:136] in your code. This will cause ff to try to load all 136 columns into memory. This is what causes the memory issues. This does not happen when you do df[1:136]. The same happens when indexing for the final result: df <- df[,-c(vector of columns)] reads all selected columns into memory.

    na_cols <- logical(136)
    for (i in seq_len(136)) {
      na_cols[i] <- all(is.na(df[[i]]))
    }
    
    res <- df[!na_cols]