php filesystems race-condition fileserver

How to handle file server race conditions?

I am developing an application that polls a folder on a network file server (cifs) for new files on a scheduled cron job every 1 minute.

When it sees a new file, it copies it to the local file system temporarily while it then does various things with the file before then deleting it from both the local and network file systems.

I have concerns about the possibility of encountering a race condition where my app polls the network folder at the same time somebody is adding a file to the network folder. The files are incredibly small (1kb) and so it should be incredibly rare that a file would still be copying when I poll the folder but it could happen.

My question, is this a legitimate concern and if so how should I handle it?

Solution

Here is how I solved my problems.

Note that I had this problem in several areas of my workflow. the first was that I had to monitor new files in a directory with my application and ensure they were finished transferring, etc. The second was that I had to upload files to a directory that another piece of software is monitoring that a) I have no control over, and b) is very archaic and is not doing any verification itself.

To solve the first problem:

In my scheduled job, I scanned the directory for all files then generated an md5 hash of each file and saved it to a table in the database along with the file path.

The next time my scheduled job runs (1 minute later), I grab all rows out of the database (file path and hash), I check if the file still exists and then I generate an md5 hash of the file again. If both the file exists and the hash is the same, I do my processing on the file (and remove it from the directory). If one of those two fails, I simply skip to the next file in the loop.

After all of the files are processed, I truncate the table that indexed the files and then reindex all the files again resaving them to the database. A minute later and my job starts over again consuming the files from the index.

This way I'm never working with files that I haven't indexed from the previous job run. I believe it's safe to assume that if the file hash hasn't changed over 1 minute then the file is finished transferring and I can consume it.

To solve the second problem:

To ensure that the other piece of software wouldn't consume a file that I might be in the middle of uploading, I simply created another directory on the server that the software wasn't monitoring and uploaded the files there. Once the files were done transferring, I issued a move command to move the files to the monitored directory and since a move is an atomic operation on the file system, it is therefore safe from the race condition.