Search code examples
pythonarraysnumpypymupdfpdf2image

PyMuPDF converted image into a numpy array?


I have an existing function using pdf2image to convert each page of a PDF into images. For a variety of reasons, I am no longer able to use pdf2image and must now instead use PyMuPDF, however, I am having trouble yielding the same results as I did from pdf2image.

The code for pdf2image and PyMuPDF are each below.

Each item in pages_list for pdf2image is a numpy.ndarray and I can verify that the PDFs were properly converted by reviewing the resulting image of Image.fromarray(pages_list[i]) using the PIL library. When I review this with the result of pdf2image I can see my original PDF as an image. When I review this with the result of PyMuPDF I see one long super skinny column of pixels that do not make a full image.

ETA: I can use Pillow locally to review the images but this will eventually be going into an AWS lambda function and I am not allowed to use Pillow nor can I save files.

pdf2image

pages = convert_from_path(img_path, 500)
pages_list = []
for i in range(len(pages)):
    pages_list.append(np.array(pages[i]))

PyMuPDF

pdf_doc = fitz.open(img_path)
pages_list = []
for i in range(len(pdf_doc)):
    page = pdf_doc[i]
    pixmap = page.get_pixmap(dpi=300)
    img = pixmap.tobytes()
    img_array = np.frombuffer(bytearray(img), dtype=np.uint8)
    img_array_np = np.array(img_array)
    pages_list.append(img_array_np)

While I did successfully convert the resulting bytes object into a numpy array, the array looks very different from the results of pdf2image. I was hoping to get an exact identical result from PyMuPDF as I did from pdf2image but not sure exactly where I'm going wrong. I imagine it's something in the way I'm converting from bytes to a numpy array, but I have yet to find a working fix.

# Repeated for pdf2image and PyMuPDF
print(f"{library_name}: \n{type(pages_list[0])}")
print(f".shape: {pages_list[0].shape}")
print(f".ndim: {pages_list[0].ndim}")
print(f".size: {pages_list[0].size}")

# pdf2image: 
# <class 'numpy.ndarray'>
# .shape: (5500, 4250, 3)
# .ndim: 3
# .size: 70125000

# PyMuPDF: 
# <class 'numpy.ndarray'>
# .shape: (378861,)
# .ndim: 1
# .size: 378861

How can I get the same results from PyMuPDF as I did from pdf2image?


Solution

  • The code block below solved my issue, courtesy of this question and a twist on the code commented by Jorj McKie.

    doc = fitz.open(img_path)
    pages_list = []
    for page in doc:
        zoom_x, zoom_y = 2.0, 2.0
        mat = fitz.Matrix(zoom_x, zoom_y)
        pix = page.get_pixmap(matrix=mat)
        im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.h, pix.w, pix.n)
        im = np.ascontiguousarray(im[..., [2, 1, 0]])
        pages_list.append(im)