Search code examples
razure-storageparquetapache-arrow

R arrow read_parquet: Call to R (seek() on R connection) from a non-R thread from an unsupported context


I am using the R arrow package read_parquet() function to read a parquet file. This function runs every night on several dozen files but as of today it has been failing for some of the files with the error message:

Call to R (seek() on R connection) from a non-R thread from an unsupported context

The error seems to be linked to the size of the file, with the first one to fail also being the first with more than 2^20 rows and 10m cells. I have tried running the job with a different order of files and it is still failing at the first "large" file. However it ran ok yesterday on files which are much bigger than this. I don't believe the arrow package has been updated in that time and even if it had I don't think the agent running the job would have installed the update. I can run the same function on the same file from my local machine and it works fine so it must be something to do with the agent running the job but I don't know where to start!

Does anyone know what that error means and what could cause it? I believe it is being generated by this line in the arrow src on gitHub.

For further detail, I believe the exact point it is failing at is when it tries to run

arrow::read_parquet(file = con)

where con is a rawConnection object given by

con <- rawConnection(raw(0), "r+")

The data in con is downloaded from azure data lake storage using

AzureStor::download_adls_file(dest = con)

These functions are used dozens of times on dozens of different files each night and worked successfully moments before on different files so there must be something related to this particular file which like I say is the first to be over 2^20 rows and 10m cells.

EDIT: The answer below from paleolimbot has solved the problem, thank you!


Solution

  • I believe this occurs because of an oversight when we added support for reading/writing R connections for the various file types (https://github.com/apache/arrow/issues/36819). Very large files are read using multiple threads (as you noted), which weren't part of our tests.

    As a workaround, you should be able to wrap the raw vector in an arrow::buffer() instead of using a rawConnection():

    library(arrow, warn.conflicts = FALSE)
    
    tmp <- tempfile()
    write_parquet(mtcars, tmp)
    
    content_raw <- readr::read_file_raw(tmp)
    content_buffer <- buffer(content_raw)
    read_parquet(content_buffer)
    #> # A tibble: 32 × 11
    #>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
    #>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
    #>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
    #>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
    #>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
    #>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
    #>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
    #>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
    #>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
    #> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
    #> # ℹ 22 more rows
    

    Created on 2023-07-21 with reprex v2.0.2