Search code examples
pythonazure-storageazure-data-lakeazure-sdk-pythonazure-data-lake-gen2

Azure DataLakeServiceClient Python - How to append, How to set Offset and Flush Length?



I want to create and repeatedly append to a csv file using DataLakeServiceClient(azure.storage.filedatalake package). The Inital create/write works as follows.
from azure.storage.filedatalake import DataLakeServiceClient 

datalake_service_client = DataLakeServiceClient.from_connection_string(connect_str)
myfilesystem = "ContainerName"
myfolder     = "FolderName"
myfile       = "FileName.csv"

file_system_client = datalake_service_client.get_file_system_client(myfilesystem)
try:                    
    directory_client = file_system_client.create_directory(myfolder)         
except Exception as e:
    directory_client = file_system_client.get_directory_client(myfolder)
file_client = directory_client.create_file(myfile)
data = """Test1"""   
file_client.append_data(data, offset=0, length=len(data))
file_client.flush_data(len(data))


Suppose the next append is for data = """Test2""", how to set the offset and flush_data?

Thanks.


Solution

  • First, you are using directory_client.create_file(myfile), this will create the new file every time. So your code will never append any content.

    Second, you need to add a judgment condition to check whether it exists, if it exists, use the get_file_client method. If not exists, use the create_file method. Total code is like below:(On my side, I am using .txt file to test.)

    from azure.storage.filedatalake import DataLakeServiceClient 
    connect_str = "DefaultEndpointsProtocol=https;AccountName=0730bowmanwindow;AccountKey=xxxxxx;EndpointSuffix=core.windows.net"
    datalake_service_client = DataLakeServiceClient.from_connection_string(connect_str)
    myfilesystem = "test"
    myfolder     = "test"
    myfile       = "FileName.txt"
    
    file_system_client = datalake_service_client.get_file_system_client(myfilesystem)            
    directory_client = file_system_client.create_directory(myfolder)         
    directory_client = file_system_client.get_directory_client(myfolder)
    print("11111")
    try:
        file_client = directory_client.get_file_client(myfile)
        file_client.get_file_properties().size
        data = "Test2"   
        print("length of data is "+str(len(data)))
        print("This is a test123")
        filesize_previous = file_client.get_file_properties().size
        print("length of currentfile is "+str(filesize_previous))
        file_client.append_data(data, offset=filesize_previous, length=len(data))
        file_client.flush_data(filesize_previous+len(data))
    except:
        file_client = directory_client.create_file(myfile)
        data = "Test2"   
        print("length of data is "+str(len(data)))
        print("This is a test")
        filesize_previous = 0
        print("length of currentfile is "+str(filesize_previous))
        file_client.append_data(data, offset=filesize_previous, length=len(data))
        file_client.flush_data(filesize_previous+len(data))
    

    On my side it is no problem, please have a try on your side.(The above is just an example, you can design better and streamlined.)