I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.
As I'm writing my task, I've got something like this:
@app.task
def do_ocr(pk, file_path):
content = run_tesseract_command(file_path)
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
The question I have what the best way is to make the file_path
work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.
If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt
folder foreach node in cluster. Then you can pass the node name to do_ocr
function and work as all data is local to current node
For example, your cluster has N nodes named: node1, ... ,nodeN
Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file
sshfs#user@node2:/var/your/app/pdfs /mnt/node2 fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
....
sshfs#user@nodeN:/var/your/app/pdfs /mnt/nodeN fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
In current node (node1) create a symlink named as current server pointing to pdf's path
ln -s /var/your/app/pdfs node1
Your mnt folder should contain remote's filesystem and a symlink
user@node1:/mnt$ ls -lsa
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node1 -> /var/your/app/pdfs
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node2
...
0 lrwxrwxrwx 1 user user 16 apr 12 2016 nodeN
Then your function should look like this:
import os
MOUNT_POINT = '/mtn'
@app.task
def do_ocr(pk, node_name, file_path):
content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
It works like all files are in the current machine but there's remote-logic working for you transparently