python ocr python-tesseract post-processing

Removing newline \n from tesseract return values

I have a bunch of image each one corresponding to a name that I'm passing to Pytesseract for recognition. Some of the names are a bit long and needed to be written in multiple lines so passing them for recognition and saving them to a .txt file resulted in each part being written in a newline.

Here's an example

This is being recognized as

MARTHE
MVUMBI

While I need them to be one the same line.

Another Example

It should be MOHAMED ASSAD YVES but it's actually being stored as:

MOHAMED

ASSAD YVES

I thought I was filtering through this sort of thing but apparently it's not working. Here's the code for recognition, storing and filtering that I'm using.

# Adding custom options
folder = rf"C:\Users\lenovo\PycharmProjects\SoftOCR_PFE\name_results"
custom_config = r'--oem 3 --psm 6'
words = []
filenames = os.listdir(folder)
filenames.sort()
for directory in filenames:
    print(directory)
    for img in glob.glob(rf"name_results\{directory}\*.png"):
        text = pytesseract.image_to_string(img, config=custom_config)
        words.append(text)
    words.append("\n")
all_caps = list([s.strip() for s in words if s == s.upper() and s != 'NOM' and s != 'PRENOM'])

no_blank = list([string for string in all_caps if string != ""])

with open('temp.txt', 'w+') as filehandle:
    for listitem in no_blank:
        filehandle.write(f'{listitem}\n')
uncleanText = open("temp.txt").read()
cleanText = re.sub('[^A-Za-z0-9\s\d]+', '', uncleanText)
open('saved_names.txt', 'w').write(cleanText)

I had to post again since my last question was posted really late at night and didn't get any action.

Solution

I would try to add after the line:

text = pytesseract.image_to_string(img, config=custom_config)

This line:

text = text.replace("\n", " ")

Update

There was another problem. How to join every second line with , in the file and save them back in the file. It can be done this way:

with open("temp.txt", "r") as f:
    names = f.readlines()

names = [n.replace("\n", "") for n in names]
names = [", ".join(names[i:i+2]) for i in range(0, len(names), 2)]

with open("temp.txt", "w") as f:
    f.write("\n".join(names))