I have the picture below used in Tesseract OCR:
My code to process the picture is:
# HOCR
with image[450:6200, 840:3550] as cropped:
imgPage = wi(image = cropped)
imageBlob = imgPage.make_blob('png')
horas = gerarHocr(imageBlob)
def gerarHocr(imageBlob):
image = Image.open(io.BytesIO(imageBlob))
markup = pytesseract.image_to_pdf_or_hocr(image, lang='por', extension='hocr', config='--psm 6')
soup = BeautifulSoup(markup, features='html.parser')
spans = soup.find_all('span', {'class' : 'ocrx_word'})
listHoras = []
...
return listHoras
Although my OCR is getting sometimes confused and duplicating 8
with 3
and returning 07:44/14:183
instead of 07:44/14:13
for example.
I think if I remove the grey lines using Wand I improve the confidence of the OCR. How do I do that, please?
Thank you,
If the system is using ImageMagick-6, you can call Image.threshold()
, but might need to remove the transparency first.
with Image(filename='PWILE.png') as img:
img.background_color = 'WHITE'
img.alpha_channel = False
img.threshold(threshold=0.5)
img.save(filename='output_threshold.png')
If you're using ImageMagick-7 (anything above version 7.0.8-41
), then Image.auto_threshold()
will work.
with Image(filename='support/PWILE.png') as img:
img.auto_threshold(method='otsu')