Search code examples
pythonimageflaskpdf

Python - Extract a PDF page as a jpeg


How can I efficiently save a particular page of a PDF as a jpeg file using Python?

I have a Python Flask web server where PDFs will be uploaded and I want to also store jpeg files that correspond to each PDF page.

This solution is close but it does not result in the entire page being converted to a jpeg.


Solution

  • The pdf2image library can be used.

    You can install it simply using,

    pip install pdf2image
    

    Once installed you can use following code to get images.

    from pdf2image import convert_from_path
    pages = convert_from_path('pdf_file', 500)
    

    Saving pages in jpeg format

    for count, page in enumerate(pages):
        page.save(f'out{count}.jpg', 'JPEG')
    

    Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

    pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install [poppler for Windows] see ** below Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

    You can install the latest version under Windows using anaconda by doing:

    conda install -c conda-forge poppler
    

    ** note: Windows 64 bit versions upto 24.08 are available at https://github.com/oschwartz10612/poppler-windows but note that for 32 bit 22.02 was the last one included in TeXLive 2022 (https://poppler.freedesktop.org/releases.html) so you'll not be getting the latest features or bug fixes.