Search code examples
pythonpandasnlp-question-answering

Python: Find starting, ending index of sub-text column from another text column


I am trying my hand at Question Answering and have to make my own dataset. I have 5 columns:

question | context | answer | answer_start | answer_end

Each record in the context column has a chunk of text, e.g.,

Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer, and the first person to walk on the Moon. He was also a naval aviator, test pilot, and university professor.

The corresponding answer contains a string of text extracted from the context, e.g.,

the first person to walk on the Moon

I need to populate answer_start and answer_end, which are the starting/ending indexes of the answer text within context. In the above example, answer_start would be 114 & answer_end would be 150. They are currently empty columns.

I tried the following:

df['answer_start'].apply(lambda x: re.search(x['answer'], x['context']).start())

But it threw an error:

TypeError: 'int' object is not subscriptable

Is there a way to fix what I have? Is there a way to do this that doesn't require a loop?


Solution

  • Try:

    df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
    df['answer_end'] = df['answer_start'] + df['answer'].str.len()
    
    >>> df[['answer_start', 'answer_end']]
       answer_start  answer_end
    0           113         149