Search code examples
pythonpandasnlp

how to use column of tokenized sentences to further tokenize into words


i have tokenized a text in a column into a new column 'token_sentences' of sentence tokens. i want to use 'token_sentences' column to create a new column 'token_words' containing tokenized words.

df i am using

article_id      article_text                                       
1           Maria Sharapova has basically no friends as te...   
2           Roger Federer advance...    
3           Roger Federer has revealed that organisers of ...   
4           Kei Nishikori will try to end his long losing ...

added token_sentences column

article_id      article_text                                      token_sentences                          
1           Maria Sharapova has basically no friends as te...    [Maria Sharapova has basically no friends as te    
2           Roger Federer advance...                             [Roger Federer advance...
3           Roger Federer has revealed that organisers of ...    [Roger Federer has revealed that organisers of...
4           Kei Nishikori will try to end his long losing ...    [Kei Nishikori will try to end his long losing...

which is a list of sentences in every row. i am unable to flatten the list in token_sentences column to be able to used in the next step

i want use token_sentences column to make the df look like

article_id  article_text    token_sentences                         token_words                       
1           Maria...        ["Maria Sharapova..",["..."]]           [Maria, Sharapova, has, basically, no, friends,...]       
2           Roger...        ["Roger Federer advanced  ...",["..."]] [Roger,Federer,...]
3           Roger...        ["Roger Federer...",["..."]]            [Roger ,Federer,...]
4           Kei ...         ["Kei Nishikori will try...",["..."]]   [Kei,Nishikori,will,try,...]


Solution

  • from nltk.tokenize import word_tokenize
    def join(token_sentences):
        return " ".join(token_sentences)
    
    new_df = df['token_sentences'].apply(join).apply(word_tokenize)
    

    new_df will be your token sentences then add this df to your df like join function for joining your sentences i didn't realise sentences were lists too if sentences not list then just remove the .apply(join)

    df['token_words'] = new_df
    

    install nltk

    pip install nltk