Search code examples
databricksazure-databricksdatabricks-sql

Move the files from one folder to another folder in Azure Data lake without mounting the ADLS to DBFS


I have a situation where I am receiving files in a folder1 and I want to perform certain transformations and move those files to another folder2.

Reason of moving the data from Folder1 to Folder2 -- I do not want to keep the older files as it will be a duplicate for processing and unnecessarily require the filtration.

I have tried this functionality using Azure Databricks and Azure Data Factory.

1. Using Azure Databricks - There are some methods we can perform this activity by using -

a) dbutils functions (it allows to "copy" and then "remove" the files. --- dbutils.fs.cp("abfss://abc@storage_account.dfs.core.windows.net/", "abfss://abc@storage_account.dfs.core.windows.net/")

dbutils.fs.rm("abfss://abc@storage_account.dfs.core.windows.net/")

b) Importing "DataLakeServiceClient" libraries and move the files. But it requires account key credentials to be noted which should be avoided.

2. Using Azure Data Factory - There is Copy activity but there is no mechanism to move the files. There is one option where - a) Copy all the files from Folder1 to Folder2 by copy activity. b) Delete all those files from Folder1 by Delete activity.


Solution

  • I tried a lot of options but it eventually we need to copy and delete. This serves the purpose but here is code which will help directly to move the files. It saves time. Instead of performing 2 activities(copy and delete), just move the files.

    By writing a simple code in Python in Databricks -

        import os
        cnt = 0
        file_list = [file.path for file in dbutils.fs.ls("abfss://<container>@<storage>.dfs.core.windows.net/<Folder1 path>/")
             if os.path.basename(file.path)]
        print(file_list)
    
        for i in file_list:
            cnt = cnt +1
            dbutils.fs.mv(i, "abfss://<container>@<storage>.dfs.core.windows.net/<Folder2 path>/")