How to split text into sentences and create a new dataframe with one sentence per row?

I have a dataframe df that has 3 columns containing speechdata: filename, president, text.

I split the text data into sentences using:

# Split 'text' column into sentences and create a new 'sentences' column
df['sentences'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))

However, this code tokenizes the text in such a way, that the split text is still in one row in the 'sentences' column. I want to create a new dataframe named 'data_sentence' that contains the split sentences and their respective filename, president but with each row containing ONE sentence.

data_sentence = pd.DataFrame(columns=['filename', 'president', 'sentnew'])

# Iterate over each row in the original DataFrame 'df'
for index, row in df.iterrows():
    filename = row['filename']
    president = row['president']
    sentences = row['sentences']
    
    # Create a temporary DataFrame for the sentences of the current row
    temp_df = pd.DataFrame({'filename': [filename] * len(sentences),
                            'president': [president] * len(sentences),
                            'sentnew': sentences})
    
    # Concatenate the temporary DataFrame with 'data_sentence'
    data_sentence = pd.concat([data_sentence, temp_df], ignore_index=True)

# Print the resulting DataFrame 'data_sentence'
print(data_sentence)

this code works but does not assign ONE sentence to ONE row.

can someone help out?

Solution

Looks like you just need to explode the sentences :

df['sentences'] = df.pop('text').apply(lambda x: nltk.sent_tokenize(x)) # use `df.pop`

data_sentence = df.explode('sentences') # <-- add this line

Output :

filename	president	sentences
file1.txt	A	How to split text into sentences and create a new dataframe with one sentence per row using NLTK and Pandas?
file1.txt	A	I have a dataframe df that has 3 columns containing speechdata: 'filename', 'president', 'text'.

Input used :

import nltk
nltk.download("punkt")

df = pd.DataFrame({
    "filename": ["file1.txt"],
    "president": ["A"],
    "text": [
        "How to split text into sentences and create a new dataframe "
        "with one sentence per row using NLTK and Pandas? "
        "I have a dataframe df that has 3 columns containing "
        "speechdata: 'filename', 'president', 'text'.",
    ]
})