I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.
At the moment I'm using mlsd()
to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.
Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.
The solution should be either in PHP or Python.
def handle(self, *args, **options):
ftp = FTP_TLS(host=host)
ftp.login(user,passwd)
ftp.prot_p()
list = ftp.mlsd("...")
for item in list:
print(item[0] + " => " + item[1]['modify'])
This code examples already runs 4 minutes.
I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.
As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t
prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.
Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.