python amazon-web-services amazon-s3 ftp ftplib

Identify new files in FTP and write them to AWS S3

I'm currently using ftplib in Python to get some files and write them to S3.

The approach I'm using is to use with open as shown below:

with open('file-name', 'wb') as fp:
        ftp.retrbinary('filename', fp.write)

to download files from FTP server and save them in a temporary folder, then upload them to S3.

I wonder if this is the best practice, because the shortcoming about this approach is:

if files are too many&big, I can download them and upload to S3, then delete them from the temp folder, but the question is if I run this script once a day, I have to download everything again, so how can I check if a file is already been downloaded & existed in S3 so that the script will only process the new-added files in FTP?

Hope this makes sense, would be great if anyone has an example or something, many thanks.

Solution

You cache the fact that you processed a given file path to persistent storage (say, a SQLite database). If the file may change after you processed it, you may be able to detect this by also caching the timestamp from FTP.dir() and/or size FTP.size(filename). If that doesn't work, you also cache a checksum (say, SHA256) of the file, then you download the file again to recalculate the checksum to see if the file changed. s3 might support a conditional upload (etag) in which case you would calculate the etag of the file, then upload it with that header set ideally with an 'Expect: 100-continue' header to see if it already got the file before you try upload data.