I am trying to extract all the images from this PDF file: https://s3.us-west-2.amazonaws.com/secure.notion-static.com/566ca0ca-393d-47d4-b3fc-eb3632777bf8/example.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAT73L2G45O3KS52Y5%2F20210610%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20210610T041944Z&X-Amz-Expires=86400&X-Amz-Signature=2f8a2d08647e4953448f890adb56d11b1d01e21b941ca3dc9f9b5ab3caa7f018&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22example.pdf%22
using the fitz (PyMuPDF module) Using the following code is extracting all the images, small icons as well. I have to avoid extracting those icons and get images only.
import fitz
file = fitz.open("example.pdf")
pdf = fitz.open(file)
page = len(file)
for pic in range(page):
image_list = pdf.getPageImageList(pic)
j = 1
for image in image_list:
xref = image[0]
pix = fitz.Pixmap(pdf, xref)
#print(len(pix)+ 88)
if pix.n < 5:
pix.writePNG(f'{pic}_{j}.png')
else:
pix1 = fitz.open(fitz.csRGB, pix)
pix1.writePNG(f'{xref}_{pic}.png')
pix1 = None
pix = None
j = j + 1
print(f'Total images on page {pic} are {len(image_list)}')
get_page_images()
returns a list of all images (directly or indirectly) referenced by the page.
>>> doc = fitz.open("pymupdf.pdf")
>>> imglist = doc.getPageImageList(0)
>>> for img in imglist: print img
((241, 0, 1043, 457, 8, 'DeviceRGB', '', 'Im1'))
In the above example doc.getPageImageList(0)
returns a list of images shown on the page. Each entry looks like [xref, smask, width, height, bpc, colorspace, alt. colorspace, name]
So, in the above example, values 1043
and 457
correspond to width and height of the image. You can provide an if condition
to eliminate small sized image/icons.
More information at this doc link