Search code examples
tesseractpython-tesseract

Strange symbol in tesseract output


I'd like to know why this symbol appears in the output and how I can remove it.

All images I use have the same behavior.

I can't get rid it.

I need the value extracted from the image without that symbol because I'll use it later in another place.

script.py

import pytesseract as ocr
from PIL import Image

custom_config = r'--psm 3'
phrase = ocr.image_to_string(Image.open('image.jpg'), config=custom_config)
print(phrase)

Using pytesseract

enter image description here

Using tesseract

![enter image description here

image.jpg

input


Solution

  • Those are form feed (FF, \u000C) characters, used by Tesseract to delimit pages of OCRed text. You can trim the output string before printing to the console.