Search code examples
rpostgresqlbigdataplr

When running PL/R on PostgreSQL, can R handle data bigger then RAM?


When I use R open source, if not using a specific package, it's not possible handle data sets bigger than RAM memory. So I would like to know if it's possible handle big data sets applying PL/R functions inside PostgreSQL.

I didn't found any documentation about this.


Solution

  • As mentioned by Hong Ooi, PL/R loads an R interpreter into the PostgreSQL backend process. So your R code is running "in database".

    There is no universal way to deal with memory limitations, but there are least two possible options:

    1. define a custom PostgreSQL aggregate, and use your PL/R function as the "final" function. In this way you are processing in groups, and thus less likely to have problems with memory. See the online PostgreSQL documentation and PL/R documentation for more detail (I don't post to stackoverflow often, so unfortunately it will not allow me to post the actual URLs for you)
    2. Use the pg.spi.cursor_open and pg.spi.cursor_fetch functions installed by PL/R into the R interpreter in order to page data into your R function in chunks.

    See PL/R docs here: http://www.joeconway.com/plr/doc/index.html

    I am guessing what you would really like to have is a data.frame in which the data is paged to and from an underlying database cursor transparently to your R code. This is on my long term TODO, but unfortunately I have not been able to find the time to work it out. I have been told that Oracle's R connector has this feature, so it seems it can be done. Patches welcomed ;-)