I am using the following code to get files in a subdirectory in a container:
from azure.storage.filedatalake import DataLakeServiceClient
remote_paths = service_client.get_file_system_client("mycontainer").get_paths(path="a/b/c")
The problem is that get_paths()
returns all files and folders in all subdirectories of c
, but I am only interested in the files in directory c
.
I am aware of .is_directory
, but this still returns files in subdirectories.
I could remove the path (a/b/c
) from the result set and then check for the existence of /
, which would indicate the file being in a subfolder, but I am wondering whether there is a better way?
You are Right, the 'get_paths()' method in Azure’s DataLakeServiceClient indeed returns all files and folders in the specified path and its subdirectories. Unfortunately, there isn’t a built-in way to limit this to only the top-level directory.
However, you can still filter the results to only include files in the top-level directory by checking if the name of the path contains any additional slashes beyond the initial directory. Here’s an example of how you can do this:
from azure.storage.filedatalake import DataLakeServiceClient
def get_top_level_files(service_client, container_name, directory_path):
file_system_client = service_client.get_file_system_client(container_name)
paths = file_system_client.get_paths(path=directory_path)
top_level_files = []
for path in paths:
# Check if the path is a file and is in the top-level directory
if not path.is_directory and '/' not in path.name[len(directory_path):]:
top_level_files.append(path)
return top_level_files
Usage: service_client = DataLakeServiceClient(...) top_level_files = get_top_level_files(service_client, "mycontainer", "a/b/c")
In this code, path.name[len(directory_path):] gets the part of the path name after the specified directory, and '/' not in path.name[len(directory_path):] checks if this part of.
Unfortunately I Don't think there is an easier way to achieve this, but this Method is solid and relaible.