I have an existing function using pdf2image
to convert each page of a PDF into images. For a variety of reasons, I am no longer able to use pdf2image
and must now instead use PyMuPDF
, however, I am having trouble yielding the same results as I did from pdf2image
.
The code for pdf2image
and PyMuPDF
are each below.
Each item in pages_list
for pdf2image
is a numpy.ndarray
and I can verify that the PDFs were properly converted by reviewing the resulting image of Image.fromarray(pages_list[i])
using the PIL
library. When I review this with the result of pdf2image
I can see my original PDF as an image. When I review this with the result of PyMuPDF
I see one long super skinny column of pixels that do not make a full image.
ETA: I can use Pillow locally to review the images but this will eventually be going into an AWS lambda function and I am not allowed to use Pillow nor can I save files.
pdf2image
pages = convert_from_path(img_path, 500)
pages_list = []
for i in range(len(pages)):
pages_list.append(np.array(pages[i]))
PyMuPDF
pdf_doc = fitz.open(img_path)
pages_list = []
for i in range(len(pdf_doc)):
page = pdf_doc[i]
pixmap = page.get_pixmap(dpi=300)
img = pixmap.tobytes()
img_array = np.frombuffer(bytearray(img), dtype=np.uint8)
img_array_np = np.array(img_array)
pages_list.append(img_array_np)
While I did successfully convert the resulting bytes object into a numpy array, the array looks very different from the results of pdf2image
. I was hoping to get an exact identical result from PyMuPDF
as I did from pdf2image
but not sure exactly where I'm going wrong. I imagine it's something in the way I'm converting from bytes to a numpy array, but I have yet to find a working fix.
# Repeated for pdf2image and PyMuPDF
print(f"{library_name}: \n{type(pages_list[0])}")
print(f".shape: {pages_list[0].shape}")
print(f".ndim: {pages_list[0].ndim}")
print(f".size: {pages_list[0].size}")
# pdf2image:
# <class 'numpy.ndarray'>
# .shape: (5500, 4250, 3)
# .ndim: 3
# .size: 70125000
# PyMuPDF:
# <class 'numpy.ndarray'>
# .shape: (378861,)
# .ndim: 1
# .size: 378861
How can I get the same results from PyMuPDF
as I did from pdf2image
?
The code block below solved my issue, courtesy of this question and a twist on the code commented by Jorj McKie.
doc = fitz.open(img_path)
pages_list = []
for page in doc:
zoom_x, zoom_y = 2.0, 2.0
mat = fitz.Matrix(zoom_x, zoom_y)
pix = page.get_pixmap(matrix=mat)
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.h, pix.w, pix.n)
im = np.ascontiguousarray(im[..., [2, 1, 0]])
pages_list.append(im)