I have a dataframe df
that has 3 columns containing speechdata: filename
, president
, text
.
I split the text data into sentences using:
# Split 'text' column into sentences and create a new 'sentences' column
df['sentences'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))
However, this code tokenizes the text in such a way, that the split text is still in one row in the 'sentences' column. I want to create a new dataframe named 'data_sentence' that contains the split sentences and their respective filename, president but with each row containing ONE sentence.
data_sentence = pd.DataFrame(columns=['filename', 'president', 'sentnew'])
# Iterate over each row in the original DataFrame 'df'
for index, row in df.iterrows():
filename = row['filename']
president = row['president']
sentences = row['sentences']
# Create a temporary DataFrame for the sentences of the current row
temp_df = pd.DataFrame({'filename': [filename] * len(sentences),
'president': [president] * len(sentences),
'sentnew': sentences})
# Concatenate the temporary DataFrame with 'data_sentence'
data_sentence = pd.concat([data_sentence, temp_df], ignore_index=True)
# Print the resulting DataFrame 'data_sentence'
print(data_sentence)
this code works but does not assign ONE sentence to ONE row.
can someone help out?
Looks like you just need to explode
the sentences :
df['sentences'] = df.pop('text').apply(lambda x: nltk.sent_tokenize(x)) # use `df.pop`
data_sentence = df.explode('sentences') # <-- add this line
Output :
filename | president | sentences |
---|---|---|
file1.txt | A | How to split text into sentences and create a new dataframe with one sentence per row using NLTK and Pandas? |
file1.txt | A | I have a dataframe df that has 3 columns containing speechdata: 'filename', 'president', 'text'. |
Input used :
import nltk
nltk.download("punkt")
df = pd.DataFrame({
"filename": ["file1.txt"],
"president": ["A"],
"text": [
"How to split text into sentences and create a new dataframe "
"with one sentence per row using NLTK and Pandas? "
"I have a dataframe df that has 3 columns containing "
"speechdata: 'filename', 'president', 'text'.",
]
})