Search code examples
pythonimagepdf

Extract a page from a pdf as a jpeg


In python code, how can I efficiently save a certain page of a PDF as a JPEG file?

Use case: I have a Python flask web server where PDFs will be uploaded and JPEGs corresponding to each page are stored.

This solution is close, but the problem is that it does not convert the entire page to JPEG.


Solution

  • The pdf2image library can be used.

    You can install it simply using,

    pip install pdf2image
    

    Once installed you can use following code to get images.

    from pdf2image import convert_from_path
    pages = convert_from_path('pdf_file', 500)
    

    Saving pages in jpeg format

    for count, page in enumerate(pages):
        page.save(f'out{count}.jpg', 'JPEG')
    

    Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

    pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

    You can install the latest version under Windows using anaconda by doing:

    conda install -c conda-forge poppler
    

    note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.