extracting images from PDF with page and screen coordinate information

I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd ideally like a Python tool

My current (arbitrary) designs is to create filenames of the form:

image_<page>_<serial_in_page>_<x1>_<x2>__<y1>_<y2>.png

Currently pdfplumber exposes cooordinates but with a PDFStream and encoding information rather than an image. Code to convert the stream to a *.png would solve the problem.

(NOTE: the pdfplumber approach of rendering to the screen and capturing the known rectangle (which I use) is not a solution as the image is often degraded and frequently overwritten with text.)

(NOTE: I have had problems with several Python tools (pdfminer.six, PuMuPDF) extracting images as they make the background black which obscures black text, etc. PDFBox (Java) doesn't have this problem.)

Solution

Python tools are likely to have similar problems to any tools even those that require a single line to manipulate images or extract their details.

Here we can see a visual layout of all the compressed images in the file by using one command line to extract images. Here the individual object references have been converted into normal tiff or jpg (other tools may use pbm and pgm especially for OCR but the result is generally similar). The Greyscale Alpha softmask (B&W) transparency components are not necessarily tied direct to a page or an image other than by internal references, and usually appear like negatives.

What you may note is that the objects that were inserted most likely as one PNG are broken in two when injected into the PDF and their scaled placement is defined. Note that a raw PNG (whatever its source common resolution was) will retain number of dots but its scale when inserted into the PDF could be totally different horizontal and vertical, thus the only meaningful data is W x H in pixel values.

It is not trivial to overlay the mask on the RGB component when simply extracted but can allow for colour changes if desired.

So PDFbox is one of the simpler/better tools for blending to a suitable output, (as you have discovered) but for Python it is generally the top end library products that can identify the placement of the two images and combine into a suitable alpha output like a new PNG.

For many suggestions see Extract images from PDF without resampling, in python?.

Your related part question was knowing where those components are placed on each page since one image (and its alpha mask) could be placed multiple times such as a heading logo on each page. Again it is easy in a single command line to see which pages are referenced by a group of images, but to see which image is placed where requires analyzing each pages resources, again requiring a library interrogation of page contents, thus best done via power house libraries such as iText or any other like PDFtron for python.

For a related command in PyMuPDF see https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_rects