I'm trying to use read_tsv_chunked()
on a large .tsv file, and would like to stop after a certain number of chunks.
@jimhester has suggested a useful approach to be able to interactively view a given chunk with browse()
: https://github.com/tidyverse/readr/issues/848#issuecomment-388234659, but I'd like to write a function that 1) returns just the chunk of interest; and 2) stops reading the file after returning that chunk.
I've modified Jim's response to return the chunk so that I can use it with a DataFrameCallback
, but can't figure out how to stop the read from within read_tsv_chunked()
.
My approach so far:
get_problem_chunk <- function(num) {
i <- 1
function(x, pos) {
if (i == num) {
i <<- i + 1
return(x)
}
i <<- i + 1
message(pos) # to see that it's scanning the whole file
return(NULL) # break() or error() cause errors
}
}
write_tsv(mtcars, "mtcars.tsv")
read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)
As you can see, that returns the chunk I want, but doesn't stop reading 'till the callback isn't getting any more chunks:
> read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)
Parsed with column specification:
cols(
mpg = col_double(),
cyl = col_integer(),
disp = col_integer(),
hp = col_integer(),
drat = col_double(),
wt = col_double(),
qsec = col_double(),
vs = col_integer(),
am = col_integer(),
gear = col_integer(),
carb = col_integer()
)
1
4
<I WANT IT TO STOP HERE, BUT DON'T KNOW HOW>
10
13
16
19
22
25
28
31
# A tibble: 3 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <int> <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 24.4 4 NA 62 3.69 3.19 20 1 0 4 2
3 22.8 4 NA 95 3.92 3.15 22.9 1 0 4 2
@jimhester to the rescue again - https://github.com/tidyverse/readr/issues/851#issuecomment-388929640
You can do this by using the SideEffectCallback (which is the default when passed a normal function) and returning the results using the <<- operator. The SideEffectCallback stops reading when the callback function returns FALSE. e.g.
library(readr) get_problem_chunk <- function(num) { i <- 1 function(x, pos) { if (i == num) { res <<- x return(FALSE) } i <<- i + 1 } } write_tsv(mtcars, "mtcars.tsv") read_tsv_chunked("mtcars.tsv", get_problem_chunk(3), chunk_size = 2, col_types = cols()) #> NULL res #> # A tibble: 2 x 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> 2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1