Search code examples
pythonpandasnltkdata-analysis

Expected string or Byte like object Error


 from nltk import word_tokenize, sent_tokenize
 text = data.loc[:, "text"]
 tokenizer = word_tokenize((text))
 print(tokenizer)

I am trying to perform a word tokenizer on a specific column on a dataset, and I have sliced out the column and passing it into the word_tokenizer object but when I try to print the words I get "Expected string or Byte like object Error".


Solution

  • let's assume this dataframe

    data = pd.DataFrame({'text':['some thing', 'word', 'some more text']})
    

    then when you run your script you should get an error because you are passing a series and not a string...

    text = data.loc[:, "text"]
    tokenizer = word_tokenize(text)
    print(tokenizer)
    

    TypeError: expected string or bytes-like object

    word_tokenize will accept strings that's why word_tokenize('some text') will work. so you need to iterate through your series:

    text = data.loc[:, "text"]
    tokenizer = [word_tokenize(text[i]) for i in range(len(text))]
    print(tokenizer)
    
    [['some', 'thing'], ['word'], ['some', 'more', 'text']]
    

    if you still get a type error then, most likely, not every value in data['text'] is a string. Let's assume this dataframe now:

    data = pd.DataFrame({'text':['some thing', 'word', 'some more text', 1]})
    

    performing the list comprehension on this dataframe will not work because you are trying to pass an int in word_tokenize

    however if you change everything to a string it should work:

    data = pd.DataFrame({'text':['some thing', 'word', 'some more text', 1]})
    data['text'] = data['text'].astype(str)
    
    text = data.loc[:, "text"]
    tokenizer = [word_tokenize(text[i]) for i in range(len(text))]
    print(tokenizer)
    
    [['some', 'thing'], ['word'], ['some', 'more', 'text'], ['1']]
    

    you check your types by print([type(text[i]) for i in range(len(text))])