Search code examples
pandasnltktokenize

How to properly tokenize column in pandas?


I am trying to solve tokenization problem in my dataset with comments from social media. I want to tokenize, lemmatize, remove punctuations and stop-words from the pandas column. I am struggling how to do it for each of the comment. I receive the following error when trying to get tokens:

import pandas as pd
import nltk
...
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message']), axis=1)

TypeError: expected string or bytes-like object

When I am trying to tell pandas that I am passing it a string object, it gives me the following error message:

merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message'].str), axis=1)

AttributeError: 'str' object has no attribute 'str'

What am I doing wrong?


Solution

  • You can use astype to force the column type to string

    merged['Clean_message'] = merged['Clean_message'].astype(str)
    

    If you want to look at what's wrong in original column, you can use

    m = merged['Clean_message'].apply(type).ne(str)
    out = merged[m]
    

    out dataframe contains the rows where the type of Clean_message column is not string.