Search code examples
python-2.7variablesocrtesseract

Creating variables with pytesseract


In my code

from PIL import Image
import pytesseract

print(pytesseract.image_to_string(Image.open('test.png')))

The results I get from here (just from the question and answers) are:

Which team surrendered
the biggest lead in Super
Bowl history?

Atlanta Falcons

Denver Broncos

Buffalo Bills

Is there any way to say that lines 1, 2, and 3 are the question, then line 5 is answer 1, etc.?


Solution

  • Depending on how your data differs between images this should work. If you always have the '?' to split on.

    image_text=pytesseract.image_to_string(Image.open('test.png'))
    text_list=image_text.split('?')
    

    This will give you a list with 2 elements. First being all before the ? and second after. Such as:

    print(text_list)
    ['Which team surrendered\nthe biggest lead in Super\nBowl history',
    '\n\nAtlanta Falcons\n\nDenver Broncos\n\nBuffalo Bills']
    

    From here you can define q and a. As the question and answer.

    q =  text_list[0]
    a =  [a for a in text_list[1].split('\n') if a]
    

    The logic above will keep the new lines for the question leaving it formatted as:

    Which team surrendered
    the biggest lead in Super
    Bowl history?
    

    Then variable a will be filled with a list of the answers without any blank lines in the list. So a print(a) would return:

    ['Atlanta Falcons', 'Denver Broncos', 'Buffalo Bills']
    

    Keep in mind, this fix is dependent on the text having a ? in it to define which half of the string is the question vs which is the answer.