Search code examples
pythonpysparkazure-databricksazure-data-lake

Copy files from one Azure storage account to another using PySpark


I'm trying to copy files who's names match certain criteria from one Azure storage account (all in data lake storage) to another. I'm currently trying to do this using PySpark. I list out the folders I want to look at, then set up spark for the "from" datalake and use dbutils to get the files in relevant folders:

spark.conf.set("fs.azure.account.key."+dev_storage_account_name+".dfs.core.windows.net",dev_storage_account_access_key)

for folder in raw_folders:
    list_of_files = dbutils.fs.ls("abfss://mycontainer@mydatalake.dfs.core.windows.net/" + folder)

Now I can check whether file names match the conditions to copy, but how do I go about actually moving my list of desired files to folders in my "to" datalake?


Solution

  • You will need to mount both containers, and use as below to move files across filesystems:

    Inside loop, for each iteration, replace the file name field with each item in array list_of_files

    dbutils.fs.mv('abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>/demo/test.csv', 'abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>/destination/renamedtest.csv')
    

    Also....

    If containers are not public, (/ if it the root folder) then use dbfs cli to move files/folders between the mount points created before.

    dbfs mv /mnt/folder1 /mnt/folder2
    

    If the access level of the containers if "anonymous read access for containers and blobs", you should be able to move files directly without even creating mounts.

    In Databricks notebook the code should be something like this -

    %fs mv /mnt/folder1 /mnt/folder2