Search code examples
pythontesseractfilenamestiffpython-tesseract

read single page .tif files as multipage.tiff from filename


UPDATE: I found out it is unreasonable to create pdf files from OCRed files

So it would be better to leave it as is without conversion. I still have the problem that some images are connected while others are 1 pagers.

data = []
listOfPages = glob.glob(r"C:/Users/name/test/*.tif")
for entry in listOfPages:
    text = pytesseract.image_to_string(
            Image.open(entry), lang="en"
        )
    data.append(text)
df0 = pd.DataFrame(data, columns =['raw_text'])

This creates a pandas df where each row is the string of the first (single) page of .tif files. How can i concatenate the tif files (see original question) in order to get the full multipage string?

original question: I want to convert the single page .tif files in my_folder to multipage .pdf files in pdf_folder. TIFFs not having subsequent pages should also be converted to single-page PDFs. Ultimately, I want a text-PDF created by OCR-ing multiple image-based TIFF files.

Therefore i infer the groups of .tiff files that should go together from the filename pattern:

Drs_1_00109_1_ADS.tif
Drs_1_00099_1_ADS_000.tif
Drs_1_00099_1_ADS_001.tif
Drs_1_00099_1_ADS_002.tif
Drs_1_00186_1_ADS.tif
Drs_1_00192_1_ADS_000.tif
Drs_1_00192_1_ADS_001.tif

For example out of Drs_1_00192_1_ADS_000.tif and Drs_1_00192_1_ADS_001.tif (which are two [single page] pictures) i want to convert to the 2 page Drs_1_00192_1_ADS.pdf having both of these pictures text data. The code works for single-page pdf creation. How can i make this work for said multipage-pattern from filename?

Thanks!


Solution

  • I would do that by globbing for all files ending in 000.tif, which presumably are the starting points for multi-page documents, then appending files that result from incrementing the suffix till a file is missing.

    #!/usr/bin/env python3
    
    import os
    from PIL import Image
    from glob import glob
    
    # Iterate over all files ending in '000.tif' and find their friends (subsequent pages)
    for filename in glob('*_000.tif'):
       # Work out stem of filename
       stem = filename.replace('_000.tif', '')
       print(f'DEBUG: stem={stem}')
    
       # Build list of images to be put in this PDF
       images = [Image.open(filename)]
       index = 1
       while True:
          this = f'{stem}_{index:03d}.tif'
          print(f'DEBUG: this={this}')
          if os.path.isfile(this):
             images.append(Image.open(this))
             index += 1
          else:
             break
       output = stem + '.pdf'
       print(f'DEBUG: Saving {len(images)} pages to {output}')
       images[0].save(output, save_all=True, append_images=images[1:])
    

    Sample Output

    DEBUG: stem=Drs_1_00192_1_ADS
    DEBUG: this=Drs_1_00192_1_ADS_001.tif
    DEBUG: this=Drs_1_00192_1_ADS_002.tif
    DEBUG: this=Drs_1_00192_1_ADS_003.tif
    DEBUG: this=Drs_1_00192_1_ADS_004.tif
    DEBUG: Saving 4 pages to Drs_1_00192_1_ADS.pdf
    DEBUG: stem=Drs_1_00099_1_ADS
    DEBUG: this=Drs_1_00099_1_ADS_001.tif
    DEBUG: this=Drs_1_00099_1_ADS_002.tif
    DEBUG: this=Drs_1_00099_1_ADS_003.tif
    DEBUG: Saving 3 pages to Drs_1_00099_1_ADS.pdf
    

    Note that you can just as easily use OpenCV for reading the file, by replacing:

    image = Image.open(filename)
    

    with

    image = cv2.imread(filename)
    

    However, you can't write a PDF so simply with OpenCV as with PIL so I just stuck with PIL. You can easily move between PIL and OpenCV if you remember that PIL uses RGB ordering whereas OpenCV uses BGR, so you can go from PIL to OpenCV with:

    OpenCVImage = np.array(PILImage)[...,::-1]
    

    and

    PILImage = Image.fromarray(OpenCVImage[...,::-1])