Search code examples
pythonpandasnlpdata-cleaning

Collapse a pandas data frame of words into sentences


My goal is to take a dataframe composed of words and tags, and collapse it into a dataframe composed of sentences and a list of tags.

Sample input:

df = pd.DataFrame([('Effect', 'O'),
               ('of', 'O'),
               ('ginseng', 'i'),
               ('extract', 'i'),
               ('supplementation', 'i'),
               ('on', 'O'),
               ('testicular', 'o'),
               ('functions', 'o'),
               ('in', 'O'),
               ('diabetic', 'p'),
               ('rats', 'p'),
               ('.', 'p'),
               ('OBJECTIVE', 'O'),
               ('It', 'O'),
               ('was', 'O')],
               columns=('token', 'annotation'))

Goal output:

df = pd.DataFrame([('Effect of ginseng extract supplementation on testicular functions in diabetic rats.', \ 
                     ['O','O','i','i','i','O','o','o','O','p','p','p','O','O','O']),
                   ('OBJECTIVE It was', ['O','O','O'])],
                   columns=('token', 'annotation'))

Sorry for the goofy example - that really is the first 15 rows of this dataset!!

Any ideas of how to compress the rows of words into rows of sentences would be much appreciated.


Solution

  • Use GroupBy.agg:

    new_df = (df.groupby(df['token'].eq('.').shift(fill_value=False).cumsum(),
            as_index=False)
                .agg({'token' : ' '.join, 'annotation': list}))
    print(new_df)
                                                   token  \
    0  Effect of ginseng extract supplementation on t...   
    1                                   OBJECTIVE It was   
    
                                 annotation  
    0  [O, O, i, i, i, O, o, o, O, p, p, p]  
    1                             [O, O, O]
    

    If you don't want include the last point:

    m = df['token'].eq('.')
    new_df = (df.groupby(m.shift(fill_value=False).cumsum().loc[~m],as_index=False)
                .agg({'token' : ' '.join, 'annotation': list}))