I am trying for the last 3 hours to read a CSV from Azure Data Lake Storage Gen2 (ADLS Gen2) into a pandas dataframe. This is very easy in Azure Blob Storage (ABS) but I can't figure out how to do this in ADLS Gen2.
I have developed the following function until now:
def read_csv_from_adls_to_df(storage_account_name, storage_account_key, container_name, directory_name, file_name):
service_client = DataLakeServiceClient(account_url=f"https://{storage_account_name}.dfs.core.windows.net", credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system = container_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.create_file(file_name)
file_download = file_client.download_file()
return None
I am not able to figure out what should I do after the file_download step. I have tried several things like readall(), readinto() but nothing seems to work.
Following is the function which I always use for reading CSV from ABS into dataframe:
def read_csv_from_blob(blob_service_client, container_name, blob_name):
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
# Retrieve extract blob file
blob_download = blob_client.download_blob()
# Read blob file into DataFrame
blob_data = io.StringIO(blob_download.content_as_text())
df = pd.read_csv(blob_data)
return df
PS: I am not doing this on databricks. I am doing this on Python.
I hope that this documentation can help https://learn.microsoft.com/en-us/azure/architecture/data-science-process/explore-data-blob
read a file from an s3 using lambda is simple, but Azure makes this simple task complicated. also, you can use Azure data factory.