Search code examples
rdplyrparquetapache-arrow

Random sampling of parquet prior to collect


I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:

library(dplyr)

set.seed(-1)

mtcars %>% slice_sample(n = 3)
#               mpg cyl  disp  hp drat    wt qsec vs am gear carb
# AMC Javelin  15.2   8 304.0 150 3.15 3.435 17.3  0  0    3    2
# Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
# Merc 240D    24.4   4 146.7  62 3.69 3.190 20.0  1  0    4    2

But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars:

library(arrow)

# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")

open_dataset("~/mtcars") %>% 
  slice_sample(n = 3) %>% 
  collect()
  
# Error in UseMethod("slice_sample") : 
#   no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"

Clearly, slice_sample isn't implemented for parquet files and neither is slice:

open_dataset("~/mtcars") %>% nrow() -> n

subsample <- sample(1:n, 3)

open_dataset("~/mtcars") %>% 
  slice(subsample) %>% 
  collect()

# Error in UseMethod("slice") : 
#   no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"

Now, I know filter is implemented, so I tried that:

open_dataset("~/mtcars") %>% 
  filter(row_number() %in% subsample) %>% 
  collect()

# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.

(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE and use that in filter.)

This error offers some helpful advice, though: collect the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.


Question: is there a way to randomly subsample a parquet file before loading it with collect?


Solution

  • It turns out that there is an example in the documentation that pretty much fulfils my goal. That example is a smidge dated, as it uses sample_frac which has been superseded rather than slice_sample, but the general principle holds so I've updated it here. As I don't know how many batches there will be, here I show how it can be done with proportions, like Pace suggested, instead of pulling a fixed number of columns.

    One issue with this approach is that (as far as I understand) it does require that the entire dataset is read in, it just does it in batches rather than in one go.

    open_dataset("~/mtcars") %>%
      map_batches(~ as_record_batch(slice_sample(as.data.frame(.), prop = 0.1))) %>%
      collect()
    
    #    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    # 1 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    # 2 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
    # 3 15.8   8  351 264 4.22 3.170 14.50  0  1    5    4