Search code examples
rbigdatasampling

Bootstrap Sampling in R on large data (too large to fit in RAM)


Is it possible to draw bootstrap samples from an .Rdata object, or any other large data object stored on disk? My current approach for sampling from very large data is to build a local MySQL database, and then draw random samples into R using SQL. Unfortunately, sampling and sorting in MySQL is not efficient at all. I'm wondering if anyone has engineered a better solution for this use case.

To get a sense of my current solution, see this question on sampling in MySQL: Simple Random Samples from a Sql database


Solution

  • General remarks

    You don't have to load all data to sample, just the ids of the rows, and sample from the ids. Then load data for the sampled rows only. In more detail:

    1. E.g. if you have a column called ID in your database, then load only this column. This should be fast, especially if ID is a single integer. Even if you have (say) 2 billion records, you only need 8GB memory for storing 2 billion integers, so this should be possible.

    2. Then sample from these ids.

    3. Then load only the records with the sampled ids.

    Just think about it. If you want to buy three random books from the online bookstore Siren, what do you do? Siren has millions of books, you can't order all them, and then choose randomly between them, and send back the rest, right? So what you do is, you ask Siren for a list of book ids (ISBN will do), this is of a size that fits into a modest computer. You choose three from the list randomly, and order these three from Siren.

    Rdata files

    This obviously does not work for .Rdata files, but .Rdata files are hopeless anyway, because you cannot load a piece of an .Rdata file. So you would need some format that is indexed.

    sqlite

    But you could use sqlite and the RSQLite package. RSQLite supports binding with data frames, so to load the sampled data all you need is to put the sampled ids in a data frame called samp_ids (single column named id), and then say something like

    ...
    my_samp <- dbGetQuery(con, "SELECT * FROM mytable WHERE id = :id", 
                          bind.data = samp_ids)
    ...
    

    This will read the records with the sampled ids.

    MySQL

    Last time I checked RMySQL did not support binding, so that is probably not the way to go. If you insist on MySQL, then the first thing I would try is creating a temporary table, with only the sampled row ids in it, and then JOIN this table with your data table. This is supported by RMySQL, it works like this:

    dbSendQuery(con, "CREATE TEMPORARY TABLE tmp (id INTEGER);")
    
    ## Stupid, but this seems to be necessary to really create the
    ## temporary table, it is an RMySQL bug that I reported long time
    ## ago: https://github.com/jeffreyhorner/RMySQL/issues/10
    try(dbSendQuery(con, "CREATE TEMPORARY TABLE tmp (id INTEGER);"))
    
    dbWriteTable(con, "tmp", samp_ids, row.names=FALSE, append=TRUE)
    

    Then you can write your SELECT to JOIN the temporary table tmp and your original data table.