python-3.x image-processing ocr tesseract python-tesseract

How to get confidence of each line using pytesseract

I have successfully setup Tesseract and can translate the images to text...

text = pytesseract.image_to_string(Image.open(image))

However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?

I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.

Solution

After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...

text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')

So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...

text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)

Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...

conf = text.groupby(['block_num'])['conf'].mean()