Search code examples
pythonpandasazureazure-data-lake-gen2

Read CSV from Azure Data Lake Storage Gen 2 to Pandas Dataframe | NO DATABRICKS


I am trying for the last 3 hours to read a CSV from Azure Data Lake Storage Gen2 (ADLS Gen2) into a pandas dataframe. This is very easy in Azure Blob Storage (ABS) but I can't figure out how to do this in ADLS Gen2.

I have developed the following function until now:

def read_csv_from_adls_to_df(storage_account_name, storage_account_key, container_name, directory_name, file_name):
    service_client = DataLakeServiceClient(account_url=f"https://{storage_account_name}.dfs.core.windows.net", credential=storage_account_key)
    file_system_client = service_client.get_file_system_client(file_system = container_name)
    directory_client = file_system_client.get_directory_client(directory_name)
    file_client = directory_client.create_file(file_name)
    file_download = file_client.download_file()

    return None

I am not able to figure out what should I do after the file_download step. I have tried several things like readall(), readinto() but nothing seems to work.

Following is the function which I always use for reading CSV from ABS into dataframe:

def read_csv_from_blob(blob_service_client, container_name, blob_name):
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

    # Retrieve extract blob file
    blob_download = blob_client.download_blob()

    # Read blob file into DataFrame
    blob_data = io.StringIO(blob_download.content_as_text())
    df = pd.read_csv(blob_data)
    return df

PS: I am not doing this on databricks. I am doing this on Python.


Solution

  • I hope that this documentation can help https://learn.microsoft.com/en-us/azure/architecture/data-science-process/explore-data-blob

    read a file from an s3 using lambda is simple, but Azure makes this simple task complicated. also, you can use Azure data factory.