Search code examples
pythonpandasnlpnltkspacy

pandas: create rows of sentences (with identifier) from text


I have a pandas dataframe that looks like this:

textID1, text1, othermetadata1
textID2, text2, othermetadata2
textID3, text3, othermetadata3

I would like to break the texts into sentences in a new data frame that would look like this:

textID1-001, sentence1 (of text1), othermetadata1
textID1-002, sentence2 (of text1), othermetadata1
textID2-001, sentence1 (of text2), othermetadata2

I know how to break texts into sentences using either the NLTK or spaCy, e.g.:

sentences = [ sent_tokenize(text) for text in texts ]

But pandas continues to confound me: how do I take the output and pack it back into a data frame? Moreover, how do I add numbers either to an extant column or create a new column that restarts numbering with each text -- my assumption being that I could merge the textID and sentenceID columns afterwards?


Solution

  • Once you have your list of sentences for a given row, you can break them into separate rows using explode(). Exploding the rows maintains the index of the original DF and using cumcount(), you can generate consecutive IDs for your original rows.

    Here I am assuming your text ID column is called "text_id" and sentences column is called "sentences".

    df = df.explode(['sentences']).reset_index().rename(columns={'index' : 'row_id'})
    df['row_id'] = df.groupby('row_id').cumcount()
    

    If you want to combine your text ID column with the row_id, you can just use the following:

    df['text_id'] = df['text_id'] + "-" + df["row_id"]
    

    Complete solution for the above:

    import nltk
    nltk.download('punkt')
    from nltk.tokenize import sent_tokenize
    
    df = pd.DataFrame([["textID1", "text1 sentence1. text1 sentence2", "othermetadata1"],
    ["textID2", "text2 sentence1", "othermetadata2"],
    ["textID3", "text3 sentence1. text3 sentence2. text3 sentence3", "othermetadata3"]], columns=["text_id", "text", "metadata"])
    
    df['text'] = df['text'].apply(sent_tokenize)
    df = df.explode("text").reset_index().rename(columns={'index' : 'row_id'})
    df['row_id'] = df.groupby('row_id').cumcount()
    df['text_id'] = df['text_id'] + "-" + df["row_id"].astype('str')
    
    df = df.drop(columns=['row_id'])
    df
    
         text_id              text        metadata
    0  textID1-0  text1 sentence1.  othermetadata1
    1  textID1-1   text1 sentence2  othermetadata1
    2  textID2-0   text2 sentence1  othermetadata2
    3  textID3-0  text3 sentence1.  othermetadata3
    4  textID3-1  text3 sentence2.  othermetadata3
    5  textID3-2   text3 sentence3  othermetadata3