Search code examples
rbigdatareshapeffffbase

Functions for creating and reshaping big data in R using the FF package


I'm new to R and the FF package, and am trying to better understand how FF allows users to work with large datasets (>4Gb). I have spent a considerable amount of time trawling the web for tutorials, but the ones I could find generally go over my head.

I learn best by doing, so as an exercise, I would like to know how to create a long-format time-series dataset, similar to R's in-built "Indometh" dataset, using arbitrary values. Then I would like to reshape it into wide format. Then I would like to save the output as a csv file.

With small datasets this is simple, and can be achieved using the following script:

##########################################
#Generate the data frame

DF<-data.frame()
for(Subject in 1:6){
  for(time in 1:11){
    DF<-rbind(DF,c(Subject,time,runif(1)))
  }
}
names(DF)<-c("Subject","time","conc")

##########################################
#Reshape to wide format

DF<-reshape(DF, v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")

##########################################
#Save csv file

write.csv(DF,file="DF.csv")

But I would like to learn to do this for file sizes of approximately 10 Gb. How would I do this using the FF package? Thanks in advance.


Solution

  • The function reshape does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase. Just use ffdfdply from package ffbase, split by Subject and apply reshape inside the function.

    An example on the Indometh dataset with 1000000 subjects.

    require(ffbase)
    require(datasets)
    data(Indometh)
    
    ## Generate some random data
    x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
    x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
    dim(x)
    [1] 11000000        3
    
    ## and reshape to wide format
    result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
      df <- reshape(datawithseveralsplitelements, 
                  v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
      as.data.frame(df)
    })
    class(result)
    [1] "ffdf"
    colnames(result)
    [1] "Subject"   "conc.0.25" "conc.0.5"  "conc.0.75" "conc.1"    "conc.1.25" "conc.2"    "conc.3"    "conc.4"    "conc.5"    "conc.6"    "conc.8"   
    dim(result)
    [1] 1000000      12