Search code examples
rdataframeffffbase

R - ff package : find the most frequent element in ffdf and delete the rows where is located


I need a suggestion to find the most frequent element in ffdf and after that to delete the rows where is located. I decided to try the ff package as I'm working with very big data and with base R I am running out of memory.

Here is a little example:

 # create a base R Matrix

 > z<-matrix(c("a", "b", "a", "c", "b", "b", "c", "c", "b", "a"),nrow=5,ncol=2,byrow = TRUE)
 > z


     [,1] [,2]
 [1,] "a"  "b" 
 [2,] "a"  "c" 
 [3,] "b"  "b" 
 [4,] "c"  "c" 
 [5,] "b"  "a" 


 # convert z to ffdf

 > u=as.data.frame(z, stringsAsFactors=TRUE)
 > u=as.ffdf(u)
 > u

  ffdf data
   V1 V2
1  a  b
2  a  c
3  b  b
4  c  c
5  b  a

Im looking for:

  • Export the most frequent element in ffdf (in this case it is "b")
  • Delete from ffdf all the rows where "b" is located

So, the new ffdf must be as below:

   V1 V2
1  a  c
2  c  c

In base R I found the way with the "table" function

  temp <- table(as.vector(z))  
  t1<-names(temp)[temp == max(temp)] 
  z1<- z[rowSums(z== t1[1]) == 0, ]    

But working with huge data I need something like the ff package.


Solution

  • require(ff)
    z <- matrix(c("a","b","f","c","f","b","e","c","b","e"),nrow=5,ncol=2,byrow = TRUE)
    u <- as.data.frame(z, stringsAsFactors=TRUE)
    u <- as.ffdf(u)
    u
    

    The following should work on any sized dataset. It uses table.ff and ffwhich from ffbase, ffrowapply from ff and indexing based on ff integer vectors.

    require(ffbase)
    require(plyr)
    ## Detect most frequent item (assuming the levels of all columns can be different)
    columnfreqs <- lapply(colnames(u), FUN=function(column) table(u[[column]]))
    columnfreqs <- lapply(columnfreqs, FUN=function(x) as.data.frame(t(as.matrix(x))))
    itemfreqs <- colSums(do.call(rbind.fill, columnfreqs), na.rm=TRUE)
    mostfrequent <- names(sort(itemfreqs, decreasing = TRUE))[1]
    
    ## Identify the lines where the most frequent item occurs in each row of the ffdf 
    idx <- ffrowapply(
      EXPR = apply(u[i1:i2,], MARGIN=1, FUN=function(row) any(row %in% mostfrequent)), 
      X=u, 
      RETURN = TRUE, FF_RETURN = TRUE, RETCOL = NULL, VMODE = "logical")
    idx <- ffwhich(idx, idx != TRUE) # remove it is in there + convert logicals to integers
    
    ## Remove them
    u[idx, ]