I really need your help with Tesseract. I'm using Tesseract and pdf2image to extract informations from a scanned PDF file. My problem is that Tesseract messes with the accents é, è et ê (i'm french) and with the lowercase "i" and upcase "I". I tried processing the images first but can't get any good output.
This the code i'm using:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
filePath = askopenfilename()
img = convert_from_path(filePath,poppler_path=r'C:\poppler-0.68.0_x86\poppler-0.68.0\bin')
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)
for page_number in range(len(img)):
img[page_number].save(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg', 'JPEG')
work_img = None
# Tesseract
custom_config = r'--oem 3 --psm 6'
kernel = np.ones((1, 1), np.uint8)
for page_number in range(len(img)):
img1 = cv2.imread(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg')
#Traitement des images afin d'obtenir une meilleure reconnaissance des caractères
gray = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
# Remove shadows
cool_img = cv2.dilate(gray, kernel, iterations=1)
norm_img = cv2.erode(cool_img, kernel, iterations=1)
# Threshold using Otsu's
work_img = cv2.threshold(cv2.bilateralFilter(norm_img, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# Save pages as images in the pdf
txt = txt + (pytesseract.image_to_string(work_img,config=custom_config).encode("utf-8")).decode('utf-8')
print("Page # {} - {}".format(str(page_number),txt))
What can I do to obtain good results ? Thanks a lot !
Maybe you have to install the french language pack, more info here
https://pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/
Furthermore, you can use ocrmypdf, for me, is the easiest way to read pdfs to text: https://pypi.org/project/ocrmypdf/