Search code examples
pythonazure-storagepython-polarsfsspec

Writing in delta using Polars and adlfs


According to How to you write polars data frames to Azure blob storage?, we can write parquet using polars directly on Azure Storage such as basic storage containers.

In my case I was required to write in Delta format, which stands on top of parquet, so I modified the code a bit since polars also supports delta

import adlfs
import polars as pl
from azure.identity.aio import DefaultAzureCredential

# pdf: pl.DataFrame
# path: str
# account_name: str
# container_name: str

credential = DefaultAzureCredential()
fs = adlfs.AzureBlobFileSystem(account_name=account_name, credential=credential)

with fs.open(f"{container_name}/way/to/{path}", mode="wb") as f:
    if path.endswith(".parquet"):
        pdf.write_parquet(f)
    else:
        pdf.write_delta(f, mode="append")

Using this code, I was able to write on the Azure filesystem when I specified a path = path/to/1.parquet but not path = path/to/delta_folder/.

In the second case, my problem was only a 0 byte file was written to delta_folder on Azure storage, f being a file pointer.

What's more, If I just use the local filesystem using pdf.write_delta(path, mode="append") it just works.

How can I modify my code to support writing recursively in the delta_folder/ in the cloud?


Solution

  • The issue is that delta wants a folder to write to (potentially) multiple files so fsspec's model of opening one file at a time isn't going to work.

    You'll need to do something like

    credential = DefaultAzureCredential()
    credentials_dict = {} #objectstore syntax see link below
    if path.endswith(".parquet"):
        with fs.open(f"{container_name}/way/to/{path}", mode="wb") as f:
            pdf.write_parquet(f)
    else:
        pdf.write_delta(
             f"abfs://{container_name}/way/to/", 
             mode="append", 
             storage_options = credentials_dict
        )
    

    See here for the key fields that are compatible with credentials_dict