Search code examples
rffffbase

Grow a ffdf data frame on disk gradually


From documentation of save.ffdf:

Using ‘save.ffdf’ automagically sets the ‘finalizer’s of the ‘ff’ vectors to ‘"close"’. This means that the data will be preserved on disk when the object is removed or the R sessions is closed. Data can be deleted either using ‘delete’ or by removing the directory where the object were saved (‘dir’).

I want to starting with a small ffdf data frame, add a bit new data at a time, and grow it on the disk. So I did a little experiment:

# in R
ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris")

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
rm(ffiris)

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

It turns out it doesn't automatically update the ff data on disk when I remove ffiris. What about saving it manually?

# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
save.ffdf(ffiris, "~/Desktop/iris")

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

Hmm, still no luck. Why?

What about removing the folder before saving?

# in R
ffiris = as.ffdf(iris)
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)

# in bash
ls ~/Desktop/iris/
# ls: /Users/ky/Desktop/iris/: No such file or directory

Even stranger. Even if this all works, it still would be terribly inefficient. I am looking for something like:

updateOnDisk(ffiris)

Could anyone help?


Solution

  • ff and ffbase offer out of memory R vectors, but introduce a reference semantics which can give problems with R idioms.

    R is a functional programming language, meaning that functions do not change parameters and objects, but return modified copies. In ffbase we implement functions in the R way, i.e. transform returns a copy of the original ffdf data.frame. This can be seen by looking at the filenames:

    ffiris = as.ffdf(iris)
    save.ffdf(ffiris, dir = "~/Desktop/iris")
    filename(ffiris) # show contents of ~/Desktop/iris
    
    ffiris =transform(ffiris, new1 = 99) # this create a copy of the whole data.frame!
    filename(ffiris)  
    
    ffiris$new2 <- ff(rep(99, nrow(iris)))  # this creates a new column, but not yet in the right directory
    filename(ffiris)
    
    save.ffdf(ffiris, dir="~/Desktop/iris", overwrite=TRUE) # this fixes that.
    

    Transform is currently inefficient to add a new column, because it copies the whole data frame (that is R semantics). This is because transform might be a temparory result and you don't wont to change the original data.

    In ffbase2 we are fixing this issue