I am trying my hand at Question Answering and have to make my own dataset. I have 5 columns:
question | context | answer | answer_start | answer_end
Each record in the context
column has a chunk of text, e.g.,
Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer, and the first person to walk on the Moon. He was also a naval aviator, test pilot, and university professor.
The corresponding answer
contains a string of text extracted from the context
, e.g.,
the first person to walk on the Moon
I need to populate answer_start
and answer_end
, which are the starting/ending indexes of the answer
text within context
. In the above example, answer_start
would be 114 & answer_end
would be 150. They are currently empty columns.
I tried the following:
df['answer_start'].apply(lambda x: re.search(x['answer'], x['context']).start())
But it threw an error:
TypeError: 'int' object is not subscriptable
Is there a way to fix what I have? Is there a way to do this that doesn't require a loop?
Try:
df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer_start'] + df['answer'].str.len()
>>> df[['answer_start', 'answer_end']]
answer_start answer_end
0 113 149