Search code examples
pythondownloadsftpparamikopysftp

Download multiple files in different SFTP directories to local


I have a scenario where we need to download certain image files in different directories in SFTP server to local.

Example : 
/IMAGES/folder1 has img11, img12, img13, img14
/IMAGES/folder2 has img21, img22, img23, img24
/IMAGES/folder3 has img31, img32, img33, img34
And I need to download img12, img23 and img34 from folder 1, 2 and 3 respectively

Right now I go inside each folder and get the images individually which takes an extraordinary amount of time(since there are 10,000s of images to download).

I have also found out that downloading a single file of the same size(as that of multiple image files) takes a fraction of the time.

My question is, is there a way to get these multiple files together instead of downloading them one after another ?

One approach I came up with was to copy all the files to a temp folder in sftp server and then download the directory but sftp does not allow 'copy', and I can not use 'rename' because then I will be moving the files to temp directory


Solution

  • You could use a process pool to open multiple sftp connections and download in parallel. For example,

    from paramiko import SSHClient
    from multiprocessing import Pool
    
    def download_init(host):
        global client, sftp
        client = SSHClient()
        client.load_system_host_keys()
        client.connect(host)
        sftp = ssh_client.open_sftp()
    
    def download_close(dummy):
        client.close()
    
    def download_worker(params):
        local_path, remote_path = *params
        sftp.get(remote_path, local_path)
    
    list_of_local_and_remote_files = [
        ["/client/files/folder1/img11", "/IMAGES/folder1/img11"],
    ]
    
    def downloader(files):
        pool_size = 8
        pool = Pool(8, initializer=download_init, 
            initargs=["sftpserver.example.com"])
        result = pool.map(download_worker, files, chunksize=10)
        pool.map(download_close, range(pool_size))
    
    if __name__ == "__main__":
        downloader(list_of_local_and_remote_files)
    

    Its unfortunate that Pool doesn't have a finalizer to undo what was set in the initializer. Its not usually necessary - the exiting process is cleanup enough. In the example I just wrote a separate worker function that cleans things up. By having 1 work item per pool process, they each get 1 call.