How to handle file paths in distributed environment

I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.

As I'm writing my task, I've got something like this:

@app.task
def do_ocr(pk, file_path):
    content = run_tesseract_command(file_path)
    item = Document.objects.get(pk=pk)
    item.content = ocr_content
    item.save()

The question I have what the best way is to make the file_path work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.

Solution

If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt folder foreach node in cluster. Then you can pass the node name to do_ocr function and work as all data is local to current node

For example, your cluster has N nodes named: node1, ... ,nodeN
Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file

sshfs#user@node2:/var/your/app/pdfs    /mnt/node2 fuse    port=<port>,defaults,user,noauto,uid=1000,gid=1000        0       0
....
sshfs#user@nodeN:/var/your/app/pdfs    /mnt/nodeN fuse    port=<port>,defaults,user,noauto,uid=1000,gid=1000        0       0

In current node (node1) create a symlink named as current server pointing to pdf's path

ln -s /var/your/app/pdfs node1

Your mnt folder should contain remote's filesystem and a symlink

user@node1:/mnt$ ls -lsa
0 lrwxrwxrwx  1 user user      16 apr 12  2016 node1 -> /var/your/app/pdfs
0 lrwxrwxrwx  1 user user      16 apr 12  2016 node2
...
0 lrwxrwxrwx  1 user user      16 apr 12  2016 nodeN

Then your function should look like this:

import os
MOUNT_POINT = '/mtn'
@app.task
def do_ocr(pk, node_name, file_path):
    content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))
    item = Document.objects.get(pk=pk)
    item.content = ocr_content
    item.save()

It works like all files are in the current machine but there's remote-logic working for you transparently