Creating variables with pytesseract

In my code

from PIL import Image
import pytesseract

print(pytesseract.image_to_string(Image.open('test.png')))

The results I get from here (just from the question and answers) are:

Which team surrendered
the biggest lead in Super
Bowl history?

Atlanta Falcons

Denver Broncos

Buffalo Bills

Is there any way to say that lines 1, 2, and 3 are the question, then line 5 is answer 1, etc.?

Solution

Depending on how your data differs between images this should work. If you always have the '?' to split on.

image_text=pytesseract.image_to_string(Image.open('test.png'))
text_list=image_text.split('?')

This will give you a list with 2 elements. First being all before the ? and second after. Such as:

print(text_list)
['Which team surrendered\nthe biggest lead in Super\nBowl history',
'\n\nAtlanta Falcons\n\nDenver Broncos\n\nBuffalo Bills']

From here you can define q and a. As the question and answer.

q =  text_list[0]
a =  [a for a in text_list[1].split('\n') if a]

The logic above will keep the new lines for the question leaving it formatted as:

Which team surrendered
the biggest lead in Super
Bowl history?

Then variable a will be filled with a list of the answers without any blank lines in the list. So a print(a) would return:

['Atlanta Falcons', 'Denver Broncos', 'Buffalo Bills']

Keep in mind, this fix is dependent on the text having a ? in it to define which half of the string is the question vs which is the answer.