Search code examples
imagepdfpdf-generationghostscriptpostscript

Why are images in pdf sometimes sliced into multiple images?


Noticed that images sometimes are sliced up in PDFs.

Steps:

  • insert an image with a high resoultion (3000x1800) into a .docx
  • use "Microsoft Print to PDF" option of Word to convert to PDF
  • extracting all images with pdfimages or pymupdf

Result:

  • Image is sliced horizontally into three images

Questions:

  • What exactly happens in the in the transition from .docx to pdf (or in generell in the process to pdf) that makes the converter slice it up into three images instead of one?
  • Do the individuell XObjects of the sliced images contain information which says that these three images belong to originally one?
  • How do I know how the images are sliced (horizontally / vertically) and what if originally there were two images inserted into the .docx file and both of them are sliced. Can you tell if slice x belongs to original image y or z?

Solution

  • So, as you have found out: because the code which generates the PDF choose to do so.

    The technical reasons may be various - it could be that historically there were printers which would only have so much memory, and would need to get limiterd size-images when printing, and someone at some point when writing the PDF export code present in Microsoft Office choose to apply this limit.

    Anyway, technically, as put in the comments, an image in a PDF file could be composed of unlimited smaller images collated together.

    Now, the second part, and your actual question: to know whether images ibn a PDF file belong together in a single original image one would need a custom extractor tool to check the geometry of all images in the document and find out which images have no margins or boundaries with others - it would not be that hard to do for well behaved files (which we can't know if MS Office generated files are: there are ways to obfuscate image positioning by making it indirectly). The metadata in the image-parts may or may not contain information that would allow one to recompose the original image: it would be up to the code generating the PDF to include this metadata or not - but the geometry can't lie in this case: if the final document presents a single image visually, it is possible to detect that when fetching the images.