In my code
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('test.png')))
The results I get from here (just from the question and answers) are:
Which team surrendered
the biggest lead in Super
Bowl history?
Atlanta Falcons
Denver Broncos
Buffalo Bills
Is there any way to say that lines 1, 2, and 3 are the question, then line 5 is answer 1, etc.?
Depending on how your data differs between images this should work. If you always have the '?' to split on.
image_text=pytesseract.image_to_string(Image.open('test.png'))
text_list=image_text.split('?')
This will give you a list with 2 elements. First being all before the ? and second after. Such as:
print(text_list)
['Which team surrendered\nthe biggest lead in Super\nBowl history',
'\n\nAtlanta Falcons\n\nDenver Broncos\n\nBuffalo Bills']
From here you can define q and a. As the question and answer.
q = text_list[0]
a = [a for a in text_list[1].split('\n') if a]
The logic above will keep the new lines for the question leaving it formatted as:
Which team surrendered
the biggest lead in Super
Bowl history?
Then variable a
will be filled with a list of the answers without any blank lines in the list. So a print(a)
would return:
['Atlanta Falcons', 'Denver Broncos', 'Buffalo Bills']
Keep in mind, this fix is dependent on the text having a ?
in it to define which half of the string is the question vs which is the answer.