Search code examples
rmatrixmemorylarge-data

Storage-less and just-in-time-calculated vectors and matrices in R


One feature that I dearly wish R had is virtual vectors and matrices, which have no storage but are actually interfaces to functions which calculate their members just in time.

One application of such a feature is hierarchially clustering large number of items whose distance metric is cheap to calculate. currently dist stores all the distances in memory, so I can use it to categorize up to only 20,000 items. But I wish to do more. Since hclust allows user to provide dist matrices, such a function would allow me to work around the memory limits.

A related but less general feature is file-mapped vectors and matrices, which namely would use files as virtual memory.

Is there a package that does this? Would it be simple to write one? If I wish to implement this myself, where should I start looking?


Solution

  • R does have some facilities for a something that looks like a variable, but is a function behind the scenes (called an active binding):

    > makeActiveBinding("rand.x", function(...) rnorm(1,100,10), .GlobalEnv)
    > rand.x
    [1] 94.1004
    > rand.x
    [1] 109.3716
    

    (but be very careful using this, it leads to obscure code and hard to trace bugs)

    You could also create an object type with a subsetting method that calculated on the fly as mentioned by @Peyton.

    And as was pointed out in the comments there are packages like ff that will store a large data object on disk, but let you access pieces as if it were in memory.

    However, none of these is likely to accomplish what you describe. Functions like agnes in the cluster library pass there argument to another function like data.table or as.dist which will effectively make a copy of your object (or the result of running the function a single time). So ff objects will be loaded fully into memory and active bindings or [ methods would be called once up front to create the entire matrix, then a copy of that would be used from then on.

    If you really want this functionality (and I can certainly see use for it), it would be better to rewrite the cluster (or other) function to accept a function instead of data or distance matrices. You can start with the function as is and just strip out the parts that are not needed any more and change the parts that extract from the data to calls to the function that provides the distance information.