I have successfully setup Tesseract and can translate the images to text...
text = pytesseract.image_to_string(Image.open(image))
However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?
I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.
After much searching, I have figured out a way. Instead of image_to_string
, one should use image_to_data
. However, this will give you statistics for each word, not each line...
text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')
So what I did was saved it as a dataframe, and then used pandas
to group by block_num
, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...
text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)
Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...
conf = text.groupby(['block_num'])['conf'].mean()