Search code examples
python-3.xpandasazure-data-lakepyarrowazure-data-lake-gen2

Reading parquet file from ADLS gen2 using service principal


I am using azure-storage-file-datalake package to connect with ADLS gen2

from azure.identity import ClientSecretCredential

# service principal credential
tenant_id = 'xxxxxxx'
client_id = 'xxxxxxxxx'
client_secret = 'xxxxxxxx'
storage_account_name = 'xxxxxxxx'

credential = ClientSecretCredential(tenant_id, client_id, client_secret)

service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
        "https", storage_account_name), credential=credential) # I have also tried blob instead of dfs in account_url

Folder structure in ADLS gen2 from where I have to read parquet file look like this. Inside container of ADLS gen2 we folder_a which contain folder_b in which there is parquet file.

folder_a
  |-folder_b
      parquet_file1

from gen1 storage we used to read parquet file like this.

from azure.datalake.store import lib
from azure.datalake.store.core import AzureDLFileSystem
import pyarrow.parquet as pq

adls = lib.auth(tenant_id=directory_id,
            client_id=app_id,
            client_secret=app_key)
adl = AzureDLFileSystem(adls, store_name=adls_name) 

f = adl.open(file, 'rb') # 'file is parquet file with path of parquet file folder_a/folder_b/parquet_file1'
table = pq.read_table(f)

How do we proceed with gen2 storage, we are stuck at this point

http://peter-hoffmann.com/2020/azure-data-lake-storage-gen-2-with-python.html is the link that we have followed.

Note - We are not using databrick to do this


Solution

  • Regarding the issue, please refer to the following code

    from azure.identity import ClientSecretCredential
    from azure.storage.filedatalake import DataLakeServiceClient
    import pyarrow.parquet as pq
    import io
    
    client_id = ''
    client_secret = ''
    tenant_id = ''
    credential = ClientSecretCredential(tenant_id, client_id, client_secret)
    
    storage_account_name = 'testadls05'
    service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
        "https", storage_account_name), credential=credential)
    file_system = '<container name>'
    file_system_client = service_client.get_file_system_client(file_system)
    
    file_path = ''
    file_client = file_system_client.get_file_client(file_path)
    data = file_client.download_file(0)
    with io.BytesIO() as b:
        data.readinto(b)
        table = pq.read_table(b)
        print(table)
    
    

    enter image description here