Search code examples
pythonpdfpython-tesseractpdfplumber

Converting pytesseract.Output.DATAFRAME into bytes or ocr'ed pdf


Is it possible to write to a pdf file retroactively using pytesseract.image_to_data() output?

For my OCR pipeline, I needed granular access to my pdf's ocr'ed data. I requested that using this method:

ocr_dataframe = pytesseract.image_to_data(
            tesseract_image, 
            output_type=pytesseract.Output.DATAFRAME,
            config=PYTESSERACT_CUSTOM_CONFIG
        )

Now, I want to extract some tabular data from the pdf using pdfplumber. However, pdfplumber must be fed using one of three inputs:

  • path to your PDF file
  • file object, loaded as bytes
  • file-like object, loaded as bytes

I am aware that I can use pytesseract to convert my original pdf to a searchable one (in bytes representation) using the following method:

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')

However, I would like to avoid ocr'ing my pdfs twice. Is it possible to combine the output from pytesseract.image_to_data() with the original image and create some kind of bytes representation?

Any help would be much appreciated!


Solution

  • Okay, so I am pretty sure that this was an impossible task I was trying to complete.

    By nature pytesseract.Output.DATAFRAME produces a pandas dataframe. Nowhere in that data structure is the original image. The output is just rows and columns of text data. No pixels, no nothing.

    Instead, I created a class that could hold the original image and the ocr output dataframe at the same time. Here is what the instance initialization looks like:

     def __init__(self, temp_image_path):
            
    
            self.image_path = pathlib.Path(temp_image_path)
            self.image = cv2.imread(temp_image_path, cv2.IMREAD_GRAYSCALE)
            self.ocr_dataframe = self.ocr()
    
      def ocr(self):
    
         
            #########################################
            # Preprocess image in prep for pytesseract ocr
            ########################################
            tesseract_image = ocr_preprocess(self.image)
    
            ########################################
            # OCR image using pytesseract
            ########################################
            ocr_dataframe = pytesseract.image_to_data(
                tesseract_image, 
                output_type=pytesseract.Output.DATAFRAME,
                config=PYTESSERACT_CUSTOM_CONFIG
            )
    
          
            return ocr_dataframe
    
    
    

    This may be a little memory intensive, but I want to avoid having to write many images.