Search code examples
pythonazureparquetazure-data-lake-gen2

How can I upload a .parquet file from my local machine to Azure Storage Data Lake Gen2?


I have a set of .parquet files in my local machine that I am trying to upload to a container in Data Lake Gen2.

I cannot do the following:

def upload_file_to_directory():
    try:

        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        directory_client = file_system_client.get_directory_client("my-directory")
        
        file_client = directory_client.create_file("uploaded-file.parquet")
        local_file = open("C:\\file-to-upload.parquet",'r')

        file_contents = local_file.read()

        file_client.append_data(data=file_contents, offset=0, length=len(file_contents))

        file_client.flush_data(len(file_contents))

    except Exception as e:
      print(e)

because the .parquet file cannot read by the .read() function.

When I try do this:

def upload_file_to_directory():

     file_system_client = service_client.get_file_system_client(file_system="my-file-system")

     directory_client = file_system_client.get_directory_client("my-directory")
        
     file_client = directory_client.create_file("uploaded-file.parquet")
     file_client.upload_file("C:\\file-to-upload.txt",'r')


I get the following error:

AttributeError: 'DataLakeFileClient' object has no attribute 'upload_file'

Any suggestions?


Solution

  • You are receiving this because you have imported DataLakeFileClient module. Try installing DataLakeServiceClient since it has upload_file method.

    pip install DataLakeServiceClient
    

    However, to read the .parquet file, one of the workarounds is to use pandas. Below is the code that worked for me.

    storage_account_name='<ACCOUNT_NAME>'
    storage_account_key='ACCOUNT_KEY'
    
    service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
                "https", storage_account_name), credential=storage_account_key)
        
    file_system_client = service_client.get_file_system_client(file_system="container")
    
    directory_client = file_system_client.get_directory_client(directory="directory")
            
    file_client = directory_client.create_file("uploaded-file.parquet")
    
    local_file = pd.read_parquet("<YOUR_FILE_NAME>.parquet")
    df = pd.DataFrame(local_file).to_parquet()
    
    file_client.upload_data(data=df,overwrite=True) #Either of the lines works
    #file_client.append_data(data=df, offset=0, length=len(df)) 
    file_client.flush_data(len(df))
    

    and you may be required to import DataLakeFileClient library to make this work:

    from azure.storage.filedatalake import DataLakeServiceClient
    import pandas as pd
    

    RESULTS:

    enter image description here