UPDATE: I found out it is unreasonable to create pdf files from OCRed files
So it would be better to leave it as is without conversion. I still have the problem that some images are connected while others are 1 pagers.
data = []
listOfPages = glob.glob(r"C:/Users/name/test/*.tif")
for entry in listOfPages:
text = pytesseract.image_to_string(
Image.open(entry), lang="en"
)
data.append(text)
df0 = pd.DataFrame(data, columns =['raw_text'])
This creates a pandas df where each row is the string of the first (single) page of .tif
files. How can i concatenate the tif files (see original question) in order to get the full multipage string?
original question:
I want to convert the single page .tif files in my_folder
to multipage .pdf
files in pdf_folder
. TIFFs not having subsequent pages should also be converted to single-page PDFs. Ultimately, I want a text-PDF created by OCR-ing multiple image-based TIFF files.
Therefore i infer the groups of .tiff
files that should go together from the filename pattern:
Drs_1_00109_1_ADS.tif
Drs_1_00099_1_ADS_000.tif
Drs_1_00099_1_ADS_001.tif
Drs_1_00099_1_ADS_002.tif
Drs_1_00186_1_ADS.tif
Drs_1_00192_1_ADS_000.tif
Drs_1_00192_1_ADS_001.tif
For example out of Drs_1_00192_1_ADS_000.tif
and
Drs_1_00192_1_ADS_001.tif
(which are two [single page] pictures) i want to convert to the 2 page Drs_1_00192_1_ADS.pdf
having both of these pictures text data.
The code works for single-page pdf creation. How can i make this work for said multipage-pattern from filename?
Thanks!
I would do that by globbing for all files ending in 000.tif
, which presumably are the starting points for multi-page documents, then appending files that result from incrementing the suffix till a file is missing.
#!/usr/bin/env python3
import os
from PIL import Image
from glob import glob
# Iterate over all files ending in '000.tif' and find their friends (subsequent pages)
for filename in glob('*_000.tif'):
# Work out stem of filename
stem = filename.replace('_000.tif', '')
print(f'DEBUG: stem={stem}')
# Build list of images to be put in this PDF
images = [Image.open(filename)]
index = 1
while True:
this = f'{stem}_{index:03d}.tif'
print(f'DEBUG: this={this}')
if os.path.isfile(this):
images.append(Image.open(this))
index += 1
else:
break
output = stem + '.pdf'
print(f'DEBUG: Saving {len(images)} pages to {output}')
images[0].save(output, save_all=True, append_images=images[1:])
Sample Output
DEBUG: stem=Drs_1_00192_1_ADS
DEBUG: this=Drs_1_00192_1_ADS_001.tif
DEBUG: this=Drs_1_00192_1_ADS_002.tif
DEBUG: this=Drs_1_00192_1_ADS_003.tif
DEBUG: this=Drs_1_00192_1_ADS_004.tif
DEBUG: Saving 4 pages to Drs_1_00192_1_ADS.pdf
DEBUG: stem=Drs_1_00099_1_ADS
DEBUG: this=Drs_1_00099_1_ADS_001.tif
DEBUG: this=Drs_1_00099_1_ADS_002.tif
DEBUG: this=Drs_1_00099_1_ADS_003.tif
DEBUG: Saving 3 pages to Drs_1_00099_1_ADS.pdf
Note that you can just as easily use OpenCV for reading the file, by replacing:
image = Image.open(filename)
with
image = cv2.imread(filename)
However, you can't write a PDF so simply with OpenCV as with PIL so I just stuck with PIL. You can easily move between PIL and OpenCV if you remember that PIL uses RGB ordering whereas OpenCV uses BGR, so you can go from PIL to OpenCV with:
OpenCVImage = np.array(PILImage)[...,::-1]
and
PILImage = Image.fromarray(OpenCVImage[...,::-1])