I am open to other ways of doing this. Here are my constraints:
Here's what I am able to do:
arrow::open_dataset()
to connect to a local parquet directory: ds <- arrow::open_dataset(filepath, partitioning = "product")
AzureStor
package. I can download a single parquet file this way and turn it into a data frame:blob <- AzureStor::storage_endpoint("{URL}", key="{KEY}")
cont <- AzureStor::storage_container(blob, "{CONTAINER-NAME}")
parq <- AzureStor::storage_download(cont, src = "{FILE-PATH}", dest = NULL)
df <- arrow::read_parquet(parq)
What I haven't been able to figure out is how to use arrow::open_dataset()
to reference the parent directory of {FILE-PATH}
, where I have all the parquet files, using the connection to the container that I'm creating with AzureStor
. arrow::open_dataset()
only accepts a character vector as the "sources" parameter. If I just give it the URL with the path, I'm not passing any kind of credential to access the container.
Unfortunately, you probably are not going to be able to do this today purely from R.
Arrow-R is based on Arrow-C++ and Arrow-C++ does not yet have a filesystem implementation for Azure. There are JIRA tickets ARROW-9611,ARROW-2034 for creating one but these tickets are not in progress at the moment.
In python it is possible to create a filesystem purely in python using the FSspec adapter. Since there is a python SDK for Azure Blob Storage it should be possible to do what you want today in python.
Presumably something similar could be created for R but you would still need to create the R equivalent of the fsspec adapter and that would involve some C++ code.