Search code examples
pythonapache-tikatika-servertika-python

How to deal with large pdf?


I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.

os.environ['TIKA_SERVER_ENDPOINT'] = 'http://0.0.0.0:9998/'

headers = {
    "X-Tika-OCRLanguage": "eng+nor",
    "X-Tika-PDFextractInlineImages": "true",  # run OCR against inline images
}

data = parser.from_buffer(
    buffer.readall(),
    xmlContent=True, 
    requestOptions={
        "headers": headers, 
        "timeout": 3600
   }
)

Is there any header I'm missing about to handle large files?

I'm using tika-server running it directly on a docker image with this command:

docker run -d -p 9998:9998 apache/tika:1.28.2-full

Thanks for your time!


Solution

  • I think I've managed to solve the problem. I only needed to change the headers, for the moment it's working:

    headers = {
        "X-Tika-OCRLanguage": "eng+nor",
        "X-Tika-PDFocrStrategy": "auto"
    }