amazon-web-services amazon-s3 bigdata boto3 pysftp

Recursively move files from SFTP to S3 preserving structure

I'm trying to recursively move files from an SFTP server to S3, possibly using boto3. I want to preserve the folder/file structure as well. I was looking to do it this way:

import pysftp

private_key = "/mnt/results/sftpkey"

srv = pysftp.Connection(host="server.com", username="user1", private_key=private_key)

srv.get_r("/mnt/folder", "./output_folder")

Then take those files and upload them to S3 using boto3. However, the folders and files on the server are numerous with deep levels and also large in size. So my machine ends up running out of memory and disk space. I was thinking of a script where I could download single files and upload single files and then delete and repeat.

I know this would take a long time to finish, but I can run this as a job without running out of space and not keep my machine open the entire time. Has anyone done something similar? Any help is appreciated!

Solution

If you can't (or don't want) to download all of the files at once before sending them to S3, then you need to download them one at a time.

Further, from there, it follows that you'll need to build a list of files to download, then work on them, transferring one file to your local computer, then sending it to S3.

A very simple version of this would look something like this:

import pysftp
import stat
import boto3
import os
import json

# S3 bucket and prefix to upload to
target_bucket = "example-bucket"
target_prefix = ""
# Root FTP folder to sync
base_path = "./"
# Both base_path and target_prefix should end in a "/"
# Or, for the prefix, be empty for the root of the bucket
srv = pysftp.Connection(
    host="server.com", 
    username="user1", 
    private_key="/mnt/results/sftpkey",
)

if os.path.isfile("all_files.json"):
    # No need to cache files more than once. This lets us restart 
    # on a failure, though really we should be caching files in 
    # something more robust than just a json file
    with open("all_files.json") as f:
        all_files = json.load(f)
else:
    # No local cache, go ahead and get the files
    print("Need to get list of files...")
    todo = [(base_path, target_prefix)]
    all_files = []

    while len(todo):
        cur_dir, cur_prefix = todo.pop(0)
        print("Listing " + cur_dir)
        for cur in srv.listdir_attr(cur_dir):
            if stat.S_ISDIR(cur.st_mode):
                # A directory, so walk into it
                todo.append((cur_dir + cur.filename + "/", cur_prefix + cur.filename + "/"))
            else:
                # A file, just add it to our cache
                all_files.append([cur_dir + cur.filename, cur_prefix + cur.filename])

    # Save the cache out to disk    
    with open("all_files.json", "w") as f:
        json.dump(all_files, f)

# And now, for every file in the cache, download it
# and turn around and upload it to S3
s3 = boto3.client('s3')
while len(all_files):
    ftp_file, s3_name = all_files.pop(0)

    print("Downloading " + ftp_file)
    srv.get(ftp_file, "_temp_")
    print("Uploading " + s3_name)
    s3.upload_file("_temp_", target_bucket, s3_name)

    # Clean up, and update the cache with one less file
    os.unlink("_temp_")
    with open("all_files.json", "w") as f:
        json.dump(all_files, f)

srv.close()

Error checking, and speed improvements are obviously possible.