Search code examples
pythonlistdataframeapplyspacy

Python dataframe delete sentences number from list


I have a column of (quite) long texts in a dataframe, and for each text, a list of sentences indexes that I would like to delete. The sentences indexes were generated by Spacy when I split texts into sentences. Please consider the following example:

import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')

data = {'text': ['I am A. I am 30 years old. I live in NY.','I am B. I am 25 years old. I live in SD.','I am C. I am 30 years old. I live in TX.'], 'todel': [[1, 2], [1], [1, 2]]}

df = pd.DataFrame(data)

def get_sentences(text):
    text_clean = nlp(text)
    sentences = text_clean.sents
    sents_list = []
    for sentence in sentences:
        sents_list.append(str(sentence))
    return sents_list

df['text'] = df['text'].apply(get_sentences)

print(df)

which gives the following:

                                           text   todel
0  [I am A., I am 30 years old., I live in NY.]  [1, 2]
1   [I am B. I am 25 years old., I live in SD.]     [1]
2   [I am C. I am 30 years old., I live in TX.]  [1, 2]

How would you delete the sentences stored in todel efficiently, knowing that I have a very large dataset with more than 50 sentences to drop for each row ?

My expected output would be:

                                  text   todel
0                      [I live in NY.]  [1, 2]
1  [I am 25 years old., I live in SD.]     [1]
2                      [I live in TX.]  [1, 2]

Solution

  • Based on the answer of @user1740577:

    def fun(sen, lst):
        return [i for j, i in enumerate(sen) if j not in lst]
    
    df['text'] = df.apply(lambda row : fun(row['text'],row['todel']), axis=1)
    

    Yields the wanted result, based on the indexing of Spacy:

                               text    todel
    0                     [I am A.]   [1, 2]
    1  [I am B. I am 25 years old.]      [1]
    2  [I am C. I am 30 years old.]   [1, 2]