Search code examples
pythonpandasnlpnltk

How to go through each row with pandas apply() and lambda to clean sentence tokens?


My goal is to created a cleaned column of the tokenized sentence within the existing dataframe. The dataset is a pandas dataframe looking like this:

Index Tokenized_sents
First [Donald, Trump, just, couldn, t, wish, all, Am]
Second [On, Friday, ,, it, was, revealed, that]
dataset['cleaned_sents'] = dataset.apply(lambda row: [w for w in row["tokenized_sents"] if len(w)>2 and w.lower() not in stop_words], axis = 1)

My current output is the dataframe without that extra column.

Current outout:

    tokenized_sents  \
0  [Donald, Trump, just, couldn, t, wish, all, Am...  

Wanted output:

  tokenized_sents  \
0  [Donald, Trump, just, couldn, wish, all...   

Basically removing all the stopwords & short words


Solution

  • Create a sentence index

    dataset['gid'] = range(1, dataset.shape[0] + 1)
    
           tokenized_sents  gid
    0  [This, is, a, test]    1
    1    [and, this, too!]    2
    

    Then explode the dataframe

    clean_df = dataset.explode('tokenized_sents')
    
      tokenized_sents  gid
    0            This    1
    0              is    1
    0               a    1
    0            test    1
    1             and    2
    1            this    2
    1            too!    2
    

    Do all the cleaning on this dataframe and use gid column to group them back. It will be the fastest way to go about doing it.

    clean_df = clean_df[clean_df.tokenized_sents.str.len() >= 2]
    .
    .
    .
    

    To get it back,

    clean_dataset = clean_df.groupby('gid').agg(list)