I have a Pandas data frame with words on each line in a column called "Word". The separator on each sentence is an empty string "", so I am using skip_blank_lines to see the separation.
df = pd.read_csv("Data-June-2023.txt", sep=" ",skip_blank_lines=False)
df.tail(20)
Index Word _ _ Tag
0 I _ _ O
1 am _ _ O
2 from _ _ O
3 Madrid _ _ B-City
4 NaN NaN NaN NaN
5 Alice _ _ B-Person
6 likes _ _ O
7 Bob _ _ B-Person
I would like to create a new column called "Sentence #" by iterating on the blank lines or NaN values. At the each NaN values in "Word", it will create a new count of the new sentence for Sentence: 1, Sentence: 2, Sentence: 3...etc
Index Sentence # Word _ _ Tag
0 Sentence: 1 I _ _ O
1 am _ _ O
2 from _ _ O
3 Oxford _ _ B-City
4 NaN NaN NaN NaN
5 Sentence: 2 Alice _ _ B-Person
6 likes _ _ O
7 Bob _ _ B-Person
8 NaN NaN NaN NaN
9 Sentence: 3 Alice _ _ B-Person
Thank you in advance!
I would use boolean indexing:
m = df['Word'].isna().shift(fill_value=True)
df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')
Output:
Index Word _ _ Tag Sentence
0 0 I _ _ O Sentence: 1
1 1 am _ _ O NaN
2 2 from _ _ O NaN
3 3 Madrid _ _ B-City NaN
4 4 NaN NaN NaN NaN NaN
5 5 Alice _ _ B-Person Sentence: 2
6 6 likes _ _ O NaN
7 7 Bob _ _ B-Person NaN