Search code examples
pythontextblob

bag-of-words approach to split message into its individual words


I am trying to split a message into its individual words, and trying to tokenized those message.

def split_into_tokens(message):
    message = unicode(message, 'utf8')  # convert bytes into proper unicode
    return TextBlob(message).words

messages.message.head().apply(split_into_tokens)

if show nameError: name "unicode" is not defined

  <ipython-input-16-98e123c365b4> in <module>()
----> 1 messages.title.head().apply(split_into_tokens)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in 
apply(self, func, convert_dtype, args, **kwds)
  3192             else:
  3193                 values = self.astype(object).values
->3194                 mapped = lib.map_infer(values, f, 
convert=convert_dtype)
   3195 
   3196         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-14-281c1d080655> in split_into_tokens(title)
      1 def split_into_tokens(title):
----> 2 title = unicode(title, utf8)  # convert bytes into proper 
      unicode
      3     return TextBlob(title).words

NameError: name 'unicode' is not defined

at the end it show unicode not defined, im trying to change the python version also remain the same issue. Did i need to replace the unicode by str in python plugin directory ?


Solution

  • I assume you're on python 3 so just try deleting the line message = unicode(message, 'utf8') – your message variable is probably a unicode string already. If it's not then it's probably a bytes object, in which case the right way to turn it into a unicode string under python 3 is message.decode('utf8'). See https://docs.python.org/3/howto/unicode.html if you want more info.