python apache-zookeeper distributed-computing

Using ZooKeeper to manage tasks which are in process or have been processed

I have a python script which periodically scans directories, processing new files. Each file takes a long time to process (many hours). I currently have the script running on a single computer, writing the names of processed files to a local file. Not fancy or robust, but it more or less works. I would like to use multiple worker machines to improve throughput (and robustness). My goals are to keep it as simple as possible. A zookeeper cluster is readily available.

My plan is to have in zookeeper a directory "started_files" with ephemeral nodes with the filename, which is known to be unique. I would have another directory "completed_files" with regular nodes with the filename. In pseudocode,

if filename does not exist in completed files:
    try:
        create emphemeral node filename in started files
        process(filename)
        create node filename in completed files
    except node exists error:
        do nothing, another worker is processing it

My first question is whether or not this is safe. Under any circumstance, can two different machines each create the same node successfully? I don't fully understand the doc. Having a file processed twice won't cause anything ALL that bad, but I would prefer it to be correct out of principle.

Secondly, is this a decent approach? Is there another approach which is clearly better? I will be processing 10's of files per DAY, so performance of this part of the application doesn't really matter to me (I sure wish processing the file was faster). Alternatively, I could have another script with just a single instance (or elect a leader) to scan for files and put them in a queue. I could modify the code which is causing these files to magically appear in the first place. I could use celery or storm. However all of those alternatives grow the scope which I am trying to keep small and simple.

Solution

In general your approach should work. It is possible, that you configure writing znodes to ZooKeeper in a way that consecutive creation of the same path will fail if it exists.

For the ephermal znodes you already found out quite well that these would vanish automatically if a client closes the connection to ZooKeeper which could,be especially useful in the case of failing compute nodes.

Other nodes can actually monitor the path with the ephermal znodes in order to figure out when it would be a good idea to scan for new tasks.

It would even be possible to implement a queue on top of ZooKeeper for instance using the sequencing of znodes; there are possible better ways.

In general I believe that a message queue system with publish subscribe pattern would scale a bit better. In that case you would only need to think about how to reschedule jobs of failed compute nodes.