Search code examples
razure-blob-storageazure-storageparquetapache-arrow

How to connect to parquet files in Azure Blob Storage with arrow::open_dataset?


I am open to other ways of doing this. Here are my constraints:

  • I have parquet files in a container in Azure Blob Storage
  • These parquet files will be partitioned by a product id, as well as the date (year/month/day)
  • I am doing this in R, and want to be able to connect interactively (not just set up a notebook in databricks, though that is something I will probably want to figure out later)

Here's what I am able to do:

  • I understand how to use arrow::open_dataset() to connect to a local parquet directory: ds <- arrow::open_dataset(filepath, partitioning = "product")
  • I can connect to, view, and download from my blob container with the AzureStor package. I can download a single parquet file this way and turn it into a data frame:
blob <- AzureStor::storage_endpoint("{URL}", key="{KEY}")
cont <- AzureStor::storage_container(blob, "{CONTAINER-NAME}")
parq <- AzureStor::storage_download(cont, src = "{FILE-PATH}", dest = NULL)
df <- arrow::read_parquet(parq)

What I haven't been able to figure out is how to use arrow::open_dataset() to reference the parent directory of {FILE-PATH}, where I have all the parquet files, using the connection to the container that I'm creating with AzureStor. arrow::open_dataset() only accepts a character vector as the "sources" parameter. If I just give it the URL with the path, I'm not passing any kind of credential to access the container.


Solution

  • Unfortunately, you probably are not going to be able to do this today purely from R.

    Arrow-R is based on Arrow-C++ and Arrow-C++ does not yet have a filesystem implementation for Azure. There are JIRA tickets ARROW-9611,ARROW-2034 for creating one but these tickets are not in progress at the moment.

    In python it is possible to create a filesystem purely in python using the FSspec adapter. Since there is a python SDK for Azure Blob Storage it should be possible to do what you want today in python.

    Presumably something similar could be created for R but you would still need to create the R equivalent of the fsspec adapter and that would involve some C++ code.