Search code examples
rreadr

Is there a way to stop readr::read_tsv_chunked() after a certain number of chunks?


I'm trying to use read_tsv_chunked() on a large .tsv file, and would like to stop after a certain number of chunks.

@jimhester has suggested a useful approach to be able to interactively view a given chunk with browse(): https://github.com/tidyverse/readr/issues/848#issuecomment-388234659, but I'd like to write a function that 1) returns just the chunk of interest; and 2) stops reading the file after returning that chunk.

I've modified Jim's response to return the chunk so that I can use it with a DataFrameCallback, but can't figure out how to stop the read from within read_tsv_chunked().

My approach so far:

get_problem_chunk <- function(num) {
  i <- 1
  function(x, pos) {
    if (i == num) {
      i <<- i + 1
      return(x)
    }
    i <<- i + 1
    message(pos) # to see that it's scanning the whole file
    return(NULL) # break() or error() cause errors
  }
}

write_tsv(mtcars, "mtcars.tsv")
read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)

As you can see, that returns the chunk I want, but doesn't stop reading 'till the callback isn't getting any more chunks:

> read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)
Parsed with column specification:
cols(
  mpg = col_double(),
  cyl = col_integer(),
  disp = col_integer(),
  hp = col_integer(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_integer(),
  am = col_integer(),
  gear = col_integer(),
  carb = col_integer()
)
1
4
<I WANT IT TO STOP HERE, BUT DON'T KNOW HOW>
10
13
16
19
22
25
28
31
# A tibble: 3 x 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <int> <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1  14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
2  24.4     4    NA    62  3.69  3.19  20       1     0     4     2
3  22.8     4    NA    95  3.92  3.15  22.9     1     0     4     2

Solution

  • @jimhester to the rescue again - https://github.com/tidyverse/readr/issues/851#issuecomment-388929640

    You can do this by using the SideEffectCallback (which is the default when passed a normal function) and returning the results using the <<- operator. The SideEffectCallback stops reading when the callback function returns FALSE. e.g.

    library(readr)
    
    get_problem_chunk <- function(num) {
      i <- 1
      function(x, pos) {
        if (i == num) {
          res <<- x
          return(FALSE)
        }
        i <<- i + 1
     }
    }
    
    write_tsv(mtcars, "mtcars.tsv")
    read_tsv_chunked("mtcars.tsv", get_problem_chunk(3), chunk_size = 2, col_types = cols())
    #> NULL
    res
    #> # A tibble: 2 x 11
    #>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
    #> 2  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1