Search code examples
opencvtesseractpython-tesseractdpipymupdf

How to get dpi of an image cropped with Python? Tesseract --dpi parameter


My code opens a pdf, converts the first page to an image, then cuts rectangles out of this image by coordinates and extracts text from each cropped rectangle using Tesseract.

I discovered that in some cases for larger images OCR performs much worse than in other cases.

After playing around with Tesseract in the command line, I also discovered that for some images Tesseract estimates the resolution itself which affects the result.

I also played around with the --dpi parameter. For some images the best results were obtained with --dpi 1800, for some with --dpi 300. I'm looking for a way to set the dpi for my images before extracting text or a way to find the dpi of my images.

I also tried to use pix.set_dpi() and get_pixmap(dpi = ..) and that didn't improve anything. I would be thankful for any suggestions

Here is the code I use:

        page = doc.load_page(0)
        page_size = page.rect
        zoom = 3
        mat = fitz.Matrix(zoom, zoom)
        pix = page.get_pixmap(matrix=mat)
        img_data = pix.samples
        img_array = np.frombuffer(img_data, dtype=np.uint8)
        img_array = img_array.reshape(pix.height, pix.width, pix.n)
        img = cv.cvtColor(img_array, cv.COLOR_RGB2BGR)
                    
        #...
        k=0
        result_dict = {}
        for i, rect in enumerate(rectangles):
            x1, y1, x2, y2 = rect
            roi = img[y1:y2, x1:x2]
            k+=1
            text = pytesseract.image_to_string(roi, lang="eng+deu")


Solution

  • Only OCR a region of a PDF page like this:

    import fitz
    doc = fitz.open("input.pdf")
    page = doc[pno]  # 0-based page number
    rect = fitz.Rect(x0, y0, x1, y1)  # an area on the page
    pix = page.get_pixmap(clip=rect, dpi=150)
    
    # make a 1-page temp PDF from the area and OCR it
    ocr = fitz.open("pdf", pix.pdfocr_tobytes())  # 1-page temp PDF
    ocrpage = ocr[0]
    text = ocrpage.get_text()  # OCRed text