Search code examples
pythonpandasnltktext-processing

How to split text into sentences and create a new dataframe with one sentence per row?


I have a dataframe df that has 3 columns containing speechdata: filename, president, text.

I split the text data into sentences using:

# Split 'text' column into sentences and create a new 'sentences' column
df['sentences'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))

However, this code tokenizes the text in such a way, that the split text is still in one row in the 'sentences' column. I want to create a new dataframe named 'data_sentence' that contains the split sentences and their respective filename, president but with each row containing ONE sentence.

data_sentence = pd.DataFrame(columns=['filename', 'president', 'sentnew'])

# Iterate over each row in the original DataFrame 'df'
for index, row in df.iterrows():
    filename = row['filename']
    president = row['president']
    sentences = row['sentences']
    
    # Create a temporary DataFrame for the sentences of the current row
    temp_df = pd.DataFrame({'filename': [filename] * len(sentences),
                            'president': [president] * len(sentences),
                            'sentnew': sentences})
    
    # Concatenate the temporary DataFrame with 'data_sentence'
    data_sentence = pd.concat([data_sentence, temp_df], ignore_index=True)

# Print the resulting DataFrame 'data_sentence'
print(data_sentence)

this code works but does not assign ONE sentence to ONE row.

can someone help out?


Solution

  • Looks like you just need to explode the sentences :

    df['sentences'] = df.pop('text').apply(lambda x: nltk.sent_tokenize(x)) # use `df.pop`
    
    data_sentence = df.explode('sentences') # <-- add this line
    

    Output :

    filename president sentences
    file1.txt A How to split text into sentences and create a new dataframe with one sentence per row using NLTK and Pandas?
    file1.txt A I have a dataframe df that has 3 columns containing speechdata: 'filename', 'president', 'text'.

    Input used :

    import nltk
    nltk.download("punkt")
    
    df = pd.DataFrame({
        "filename": ["file1.txt"],
        "president": ["A"],
        "text": [
            "How to split text into sentences and create a new dataframe "
            "with one sentence per row using NLTK and Pandas? "
            "I have a dataframe df that has 3 columns containing "
            "speechdata: 'filename', 'president', 'text'.",
        ]
    })