I am trying to convert multiple pdfs (10k +) to jpg images and extract text from them. I am currently using the pdf2image
python library but it is rather slow, is there any faster/fastest library than this?
from pdf2image import convert_from_bytes
images = convert_from_bytes(open(path,"rb").read())
Note : I am using ubantu 18.04
CPU : 4 core 8 thread ( ryzen 3 3100)
memory : 8 GB
pyvips is a bit quicker than pdf2image. I made a tiny benchmark:
#!/usr/bin/python3
import sys
from pdf2image import convert_from_bytes
images = convert_from_bytes(open(sys.argv[1], "rb").read())
for i in range(len(images)):
images[i].save(f"page-{i}.jpg")
With this test document I see:
$ /usr/bin/time -f %M:%e ./pdf.py nipguide.pdf
1991624:4.80
So 2GB of memory and 4.8s of elapsed time.
You could write this in pyvips as:
#!/usr/bin/python3
import sys
import pyvips
image = pyvips.Image.new_from_file(sys.argv[1])
for i in range(image.get('n-pages')):
image = pyvips.Image.new_from_file(filename, page=i)
image.write_to_file(f"page-{i}.jpg")
I see:
$ /usr/bin/time -f %M:%e ./vpdf.py nipguide.pdf[dpi=200]
676436:2.57
670MB of memory and 2.6s elapsed time.
They are both using poppler behind the scenes, but pyvips calls directly into the library rather than using processes and temp files, and can overlap load and save.
You can configure pyvips to use pdfium rather than poppler, though it's a bit more work, since pdfium is still not packaged by many distributions. pdfium can be perhaps 3x faster than poppler for some PDFs.
You can use multiprocessing to get a further speedup. This will work better with pyvips because of the lower memory use, and the fact that it's not using huge temp files.
If I modify the pyvips code to only render a single page, I can use gnu parallel to render each page in a separate process:
$ time parallel ../vpdf.py us-public-health-and-welfare-code.pdf[dpi=150] ::: {1..100}
real 0m1.846s
user 0m38.200s
sys 0m6.371s
So 100 pages at 150dpi in 1.8s.