Search code examples
pythonpandasdataframegroup-by

Pandas dataframe - Separate sentences by NaN values


I have a Pandas data frame with words on each line in a column called "Word". The separator on each sentence is an empty string "", so I am using skip_blank_lines to see the separation.

df = pd.read_csv("Data-June-2023.txt", sep=" ",skip_blank_lines=False)
df.tail(20)

Index   Word    _   _   Tag

0   I   _   _   O
1   am  _   _   O
2   from    _   _   O
3   Madrid  _   _   B-City
4   NaN   NaN  NaN  NaN
5   Alice   _   _   B-Person
6   likes   _   _   O
7   Bob _   _   B-Person

I would like to create a new column called "Sentence #" by iterating on the blank lines or NaN values. At the each NaN values in "Word", it will create a new count of the new sentence for Sentence: 1, Sentence: 2, Sentence: 3...etc

Index   Sentence #  Word    _   _   Tag

0   Sentence: 1 I   _   _   O
1               am  _   _   O
2               from    _   _   O
3               Oxford  _   _   B-City
4               NaN NaN NaN NaN
5   Sentence: 2 Alice   _   _   B-Person
6               likes   _   _   O
7               Bob _   _   B-Person
8               NaN NaN NaN NaN
9   Sentence: 3 Alice   _   _   B-Person

Thank you in advance!


Solution

  • I would use boolean indexing:

    m = df['Word'].isna().shift(fill_value=True)
    df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')
    

    Output:

       Index    Word    _    _       Tag     Sentence
    0      0       I    _    _         O  Sentence: 1
    1      1      am    _    _         O          NaN
    2      2    from    _    _         O          NaN
    3      3  Madrid    _    _    B-City          NaN
    4      4     NaN  NaN  NaN       NaN          NaN
    5      5   Alice    _    _  B-Person  Sentence: 2
    6      6   likes    _    _         O          NaN
    7      7     Bob    _    _  B-Person          NaN