Search code examples
pythonpandasnltktokenizecpu-word

Python Pandas NLTK Tokenize Column in Pandas Dataframe: expected string or bytes-like object


I have the following sample data frame with a 'problem_definition' column:

ID  problem_definition  
1   cat, dog fish
2   turtle; cat; fish fish
3   hello book fish 
4   dog hello fish cat

I want to word tokenize the 'problem_definition' column.

Below is my code:

from nltk.tokenize import sent_tokenize, word_tokenize 
import pandas as pd 

df = pd.read_csv('log_page_nlp_subset.csv')

df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)

The code above gives me the following error:

TypeError: expected string or bytes-like object


Solution

  • There is probably a non-string-like object (such as NaN) in your actual df['TEXT'] which is not shown in the data you posted.

    Here is how you might be able to find the problematic values:

    mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
    print(df.loc[~mask])
    

    If you wish to remove these rows, you could use

    df = df.loc[mask]
    

    Or, as PineNuts0 points out, the entire column can be coerced to str dtype using

    df['TEXT'] = df['TEXT'].astype(str)
    

    For example if there is a NaN value in df['TEXT'],

    import pandas as pd
    from nltk.tokenize import sent_tokenize, word_tokenize 
    
    df = pd.DataFrame({'ID': [1, 2, 3, 4],
                       'TEXT': ['cat, dog fish',
                                'turtle; cat; fish fish',
                                'hello book fish',
                                np.nan]})
    #    ID                    TEXT
    # 0   1           cat, dog fish
    # 1   2  turtle; cat; fish fish
    # 2   3         hello book fish
    # 3   4                     NaN
    
    # df['TEXT'].apply(word_tokenize)
    # TypeError: expected string or buffer
    
    
    mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
    df = df.loc[mask]
    #    ID                    TEXT
    # 0   1           cat, dog fish
    # 1   2  turtle; cat; fish fish
    # 2   3         hello book fish
    

    and now applying word_tokenize works:

    In [108]: df['TEXT'].apply(word_tokenize)
    Out[108]: 
    0                [cat, ,, dog, fish]
    1    [turtle, ;, cat, ;, fish, fish]
    2                [hello, book, fish]
    Name: TEXT, dtype: object