Search code examples
rcsvchunksreadr

Reading csv files in chunks with `readr::read_csv_chunked()`


I want to read larger csv files but run into memory problems. Thus, I would like to try reading them in chunks with read_csv_chunked() from the readr package. My problem is that I do not really understand the callback argument.

This is a minimal example of what I have tried so far (I know I would have to include the desired operations into f(), otherwise there would not be an advandate in terms of memory usage, right?):

library(tidyverse)
data(diamonds)
write_csv(diamonds, "diamonds.csv") # to have a csv to read

f <- function(x) {x}
diamonds_chunked <- read_csv_chunked("diamonds.csv", 
                                     callback = DataFrameCallback$new(f),
                                     chunk_size = 10000)

I tried to keep the callback argument close to the example from the official documentation:

# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), 
                 DataFrameCallback$new(f), 
                 chunk_size = 5)

However, I receive the error below which seems to appear after the first chunk has been read since I see the progress bar moving to 18%.

Error in eval(substitute(expr), envir, enclos) : unused argument (index)

I already tried to include the manipulations that I want to make inside of f(), but I still got the same error.


Solution

  • I figured out that the function to be called in DataFrameCallback$new() always needs to have one additional argument (pos in the example from the documentation). This argument does not have to be used so I do not really understand its purpose. But at least, it works this way.

    Does anyone know more details about this second argument?