Search code examples
rtidymodelsrsample

Does rsample::bootstraps store data rather than just row indices?


I'm trying to understand why the rsample::bootstraps function apparently stores the entire data set for each bootstrap sample. I was expecting the function would just store the dataset once, along with the bootstrap indices for each resample. In the following you can see the basic structure, which is repeated for each resample:

> set.seed(1)
> test <- rsample::bootstraps(mtcars[, 1:3], times = 2)
> str(test)
bootstraps [2 × 2] (S3: bootstraps/rset/tbl_df/tbl/data.frame)
 $ splits:List of 2
  ..$ :List of 4
  .. ..$ data  :'data.frame':   32 obs. of  3 variables:
  .. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  .. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
  .. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
  .. ..$ in_id : int [1:32] 25 4 7 1 2 29 23 11 14 18 ...
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Bootstrap1"
  .. ..- attr(*, "class")= chr [1:2] "rsplit" "boot_split"
  ..$ :List of 4
  .. ..$ data  :'data.frame':   32 obs. of  3 variables:
  .. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  .. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
  .. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
  .. ..$ in_id : int [1:32] 25 12 15 1 20 3 6 10 10 6 ...
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Bootstrap2"
  .. ..- attr(*, "class")= chr [1:2] "rsplit" "boot_split"
 $ id    : chr [1:2] "Bootstrap1" "Bootstrap2"
 - attr(*, "times")= num 2
 - attr(*, "apparent")= logi FALSE
 - attr(*, "strata")= logi FALSEbootstraps [1 × 2] (S3: 

The $data item appears to be repeated for additional resamples and the resample indices which vary are stored in in_id. The obvious cost is that the size of the object grows in proportion to the data size times the number of resamples. The size of a single resample from object.size(test) is 7800 bytes. For 200 resamples it's 1236824 bytes.


Solution

  • The data is not repeated every time for each resample; you can see an example of this in the README for the rsample package. The original data is not modified; R does not make a copy.

    There is some RAM overhead for each resample and mtcars is a little bit small to be able to understand this well, so let's look at a bigger dataset, such as the Ames housing dataset (look at the README for a different example):

    library(rsample)
    library(lobstr)
    data(ames, package = "modeldata")
    
    obj_size(ames)
    #> 1,042,736 B
    
    set.seed(123)
    boots <- bootstraps(ames, times = 50)
    obj_size(boots)
    #> 1,670,760 B
    
    ## what is the object size per resample?
    obj_size(boots)/nrow(boots)
    #> 33,415.2 B
    
    ## what is the relative size of the bootstrap object compared to the original?
    as.numeric(obj_size(boots)/obj_size(ames))
    #> [1] 1.602285
    

    Created on 2021-02-17 by the reprex package (v1.0.0)

    Much, much less than 50! The bootstrap object is much less than 50 times bigger in memory than the original dataset.

    Notice that I used lobstr to compare the sizes instead of object.size(). The reason is because object.size() does not entirely include the size of all environments in objects and is less accurate overall. If you've ever tried to measure the RAM of objects in R using object.size() and felt confused by why it didn't match up with what your OS was saying, this is probably why. Using lobstr::obj_size() can solve this problem.