Search code examples
python-3.ximage-processingocrtesseractpython-tesseract

How to get confidence of each line using pytesseract


I have successfully setup Tesseract and can translate the images to text...

text = pytesseract.image_to_string(Image.open(image))

However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?

I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.


Solution

  • After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...

    text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')
    

    So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...

    text = text[text.conf != -1]
    lines = text.groupby('block_num')['text'].apply(list)
    

    Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...

    conf = text.groupby(['block_num'])['conf'].mean()