I am using python version of the polars
library to read a parquet file with large no of rows . Here is the link to the library - https://github.com/pola-rs/polars
I am trying to read a parquet file from Azure storage account using the read_parquet
method . I can see there is a storage_options
argument which can be used to specify how to connect to the data storage.Here is the definition of the of read_parquet
method -
def read_parquet(
source: str | Path | BinaryIO | BytesIO | bytes,
columns: list[int] | list[str] | None = None,
n_rows: int | None = None,
use_pyarrow: bool = False,
memory_map: bool = True,
storage_options: dict[str, object] | None = None,
parallel: ParallelStrategy = "auto",
row_count_name: str | None = None,
row_count_offset: int = 0,
low_memory: bool = False,
pyarrow_options: dict[str, object] | None = None,
) -> DataFrame:
Can anyone let me know what values do I need to provide as part of the storage_options to connect to the Azure storage account if I am using a system assigned managed identity. Unfortunately I could not find any example for this . Most of the examples are using connection string and access keys and due to security reasons I cannot use them.
edit : I just came to know that the storage_options are passed to another library called ffspec
. But I have no idea about it.
I finally figured out the solution, anyone who is looking to use managed identity to connect to azure data lake storage gen2 account follow the below steps. As someone mentioned in the comments, polars is using fsspec and adlfs python library to connect to remote files in Azure Cloud. To connect using managed identity we can use the below code -
import polars as pl
storage_options={'account_name': ACCOUNT_NAME, 'anon': False}
df = pl.read_parquet(path=<remote-file-path>,columns=<list of columns>,storage_options=storage_options)
This will try to use DefaultAzureCredential
from azure.identity
library to connect to the storage account. If you already have managed identity enabled for your Azure resource with proper RBAC permission, you should be able to connect.
Documentation : https://github.com/fsspec/adlfs#setting-credentials