Search code examples
celeryfilepathdistributed-computing

How to handle file paths in distributed environment


I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.

As I'm writing my task, I've got something like this:

@app.task
def do_ocr(pk, file_path):
    content = run_tesseract_command(file_path)
    item = Document.objects.get(pk=pk)
    item.content = ocr_content
    item.save()

The question I have what the best way is to make the file_path work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.


Solution

  • If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt folder foreach node in cluster. Then you can pass the node name to do_ocr function and work as all data is local to current node

    For example, your cluster has N nodes named: node1, ... ,nodeN
    Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file

    sshfs#user@node2:/var/your/app/pdfs    /mnt/node2 fuse    port=<port>,defaults,user,noauto,uid=1000,gid=1000        0       0
    ....
    sshfs#user@nodeN:/var/your/app/pdfs    /mnt/nodeN fuse    port=<port>,defaults,user,noauto,uid=1000,gid=1000        0       0
    

    In current node (node1) create a symlink named as current server pointing to pdf's path

    ln -s /var/your/app/pdfs node1
    

    Your mnt folder should contain remote's filesystem and a symlink

    user@node1:/mnt$ ls -lsa
    0 lrwxrwxrwx  1 user user      16 apr 12  2016 node1 -> /var/your/app/pdfs
    0 lrwxrwxrwx  1 user user      16 apr 12  2016 node2
    ...
    0 lrwxrwxrwx  1 user user      16 apr 12  2016 nodeN
    

    Then your function should look like this:

    import os
    MOUNT_POINT = '/mtn'
    @app.task
    def do_ocr(pk, node_name, file_path):
        content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))
        item = Document.objects.get(pk=pk)
        item.content = ocr_content
        item.save()
    

    It works like all files are in the current machine but there's remote-logic working for you transparently