I have a pandas dataframe that looks like this:
textID1, text1, othermetadata1
textID2, text2, othermetadata2
textID3, text3, othermetadata3
I would like to break the texts into sentences in a new data frame that would look like this:
textID1-001, sentence1 (of text1), othermetadata1
textID1-002, sentence2 (of text1), othermetadata1
textID2-001, sentence1 (of text2), othermetadata2
I know how to break texts into sentences using either the NLTK or spaCy, e.g.:
sentences = [ sent_tokenize(text) for text in texts ]
But pandas continues to confound me: how do I take the output and pack it back into a data frame? Moreover, how do I add numbers either to an extant column or create a new column that restarts numbering with each text -- my assumption being that I could merge the textID and sentenceID columns afterwards?
Once you have your list of sentences for a given row, you can break them into separate rows using explode(). Exploding the rows maintains the index of the original DF and using cumcount(), you can generate consecutive IDs for your original rows.
Here I am assuming your text ID column is called "text_id" and sentences column is called "sentences".
df = df.explode(['sentences']).reset_index().rename(columns={'index' : 'row_id'})
df['row_id'] = df.groupby('row_id').cumcount()
If you want to combine your text ID column with the row_id, you can just use the following:
df['text_id'] = df['text_id'] + "-" + df["row_id"]
Complete solution for the above:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
df = pd.DataFrame([["textID1", "text1 sentence1. text1 sentence2", "othermetadata1"],
["textID2", "text2 sentence1", "othermetadata2"],
["textID3", "text3 sentence1. text3 sentence2. text3 sentence3", "othermetadata3"]], columns=["text_id", "text", "metadata"])
df['text'] = df['text'].apply(sent_tokenize)
df = df.explode("text").reset_index().rename(columns={'index' : 'row_id'})
df['row_id'] = df.groupby('row_id').cumcount()
df['text_id'] = df['text_id'] + "-" + df["row_id"].astype('str')
df = df.drop(columns=['row_id'])
df
text_id text metadata
0 textID1-0 text1 sentence1. othermetadata1
1 textID1-1 text1 sentence2 othermetadata1
2 textID2-0 text2 sentence1 othermetadata2
3 textID3-0 text3 sentence1. othermetadata3
4 textID3-1 text3 sentence2. othermetadata3
5 textID3-2 text3 sentence3 othermetadata3