Python - Extract a PDF page as a jpeg

How can I efficiently save a particular page of a PDF as a jpeg file using Python?

I have a Python Flask web server where PDFs will be uploaded and I want to also store jpeg files that correspond to each PDF page.

This solution is close but it does not result in the entire page being converted to a jpeg.

Solution

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install [poppler for Windows] see ** below Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

** note: Windows 64 bit versions upto 24.08 are available at https://github.com/oschwartz10612/poppler-windows but note that for 32 bit 22.02 was the last one included in TeXLive 2022 (https://poppler.freedesktop.org/releases.html) so you'll not be getting the latest features or bug fixes.