Search code examples
ocrpaperless

Paperless-ngx redo OCR for documents


I'm trying to redo the OCR for my documents on Paperless-ngx, because some obvious text on the PDF's are missing or not indexed automatically. What should I do to redo OCR for specific documents ?

I'm using the docker installation so I have the following containers running:

paperless-webserver-1
paperless-broker-1
paperless-db-1
paperless-gotenberg-1
paperless-tika-1

I have found the following discussing on the GitHub page but it doesn't tell how to actually do it, just "implemented".

There are also mentions of PAPERLESS_OCR_MODE=<mode> in their documentation. However again, no example and I couldn't find where to apply the setting.

Thank you :)


Solution

  • You can trigger a force OCR by running this command:

    docker exec -d  -e "PAPERLESS_OCR_MODE=force" paperless-webserver-1 document_archiver --overwrite --document [HERE_COMES_THE_DOCUMENT_ID]