I am using the R arrow
package read_parquet()
function to read a parquet file. This function runs every night on several dozen files but as of today it has been failing for some of the files with the error message:
Call to R (seek() on R connection) from a non-R thread from an unsupported context
The error seems to be linked to the size of the file, with the first one to fail also being the first with more than 2^20 rows and 10m cells. I have tried running the job with a different order of files and it is still failing at the first "large" file. However it ran ok yesterday on files which are much bigger than this. I don't believe the arrow
package has been updated in that time and even if it had I don't think the agent running the job would have installed the update. I can run the same function on the same file from my local machine and it works fine so it must be something to do with the agent running the job but I don't know where to start!
Does anyone know what that error means and what could cause it? I believe it is being generated by this line in the arrow src on gitHub.
For further detail, I believe the exact point it is failing at is when it tries to run
arrow::read_parquet(file = con)
where con
is a rawConnection object given by
con <- rawConnection(raw(0), "r+")
The data in con
is downloaded from azure data lake storage using
AzureStor::download_adls_file(dest = con)
These functions are used dozens of times on dozens of different files each night and worked successfully moments before on different files so there must be something related to this particular file which like I say is the first to be over 2^20 rows and 10m cells.
EDIT: The answer below from paleolimbot has solved the problem, thank you!
I believe this occurs because of an oversight when we added support for reading/writing R connections for the various file types (https://github.com/apache/arrow/issues/36819). Very large files are read using multiple threads (as you noted), which weren't part of our tests.
As a workaround, you should be able to wrap the raw vector in an arrow::buffer()
instead of using a rawConnection()
:
library(arrow, warn.conflicts = FALSE)
tmp <- tempfile()
write_parquet(mtcars, tmp)
content_raw <- readr::read_file_raw(tmp)
content_buffer <- buffer(content_raw)
read_parquet(content_buffer)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
Created on 2023-07-21 with reprex v2.0.2