Search code examples
pythonpython-3.xazure-data-lake-gen2

Azure ADLS Gen2: List files in top level directory only?


I am using the following code to get files in a subdirectory in a container:

from azure.storage.filedatalake import DataLakeServiceClient
remote_paths = service_client.get_file_system_client("mycontainer").get_paths(path="a/b/c")

The problem is that get_paths() returns all files and folders in all subdirectories of c, but I am only interested in the files in directory c.

I am aware of .is_directory, but this still returns files in subdirectories.

I could remove the path (a/b/c) from the result set and then check for the existence of /, which would indicate the file being in a subfolder, but I am wondering whether there is a better way?


Solution

  • You are Right, the 'get_paths()' method in Azure’s DataLakeServiceClient indeed returns all files and folders in the specified path and its subdirectories. Unfortunately, there isn’t a built-in way to limit this to only the top-level directory.

    However, you can still filter the results to only include files in the top-level directory by checking if the name of the path contains any additional slashes beyond the initial directory. Here’s an example of how you can do this:

    from azure.storage.filedatalake import DataLakeServiceClient
    
    def get_top_level_files(service_client, container_name, directory_path):
        file_system_client = service_client.get_file_system_client(container_name)
        paths = file_system_client.get_paths(path=directory_path)
        
        top_level_files = []
        
        for path in paths:
            # Check if the path is a file and is in the top-level directory
            if not path.is_directory and '/' not in path.name[len(directory_path):]:
                top_level_files.append(path)
        
        return top_level_files
    

    Usage: service_client = DataLakeServiceClient(...) top_level_files = get_top_level_files(service_client, "mycontainer", "a/b/c")

    In this code, path.name[len(directory_path):] gets the part of the path name after the specified directory, and '/' not in path.name[len(directory_path):] checks if this part of.
    Unfortunately I Don't think there is an easier way to achieve this, but this Method is solid and relaible.