Search code examples
pythonnlptokenize

Tokenize list of strings without comma separation


I'm still new to Python and want to know how I can tokenize a list of strings without every word being separated by a comma.

For example, starting from a list like ['I have to get groceries.','I need some bananas.','Anything else?'], I want to obtain a list like this: ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']. The point is thus not to create a list with separate tokens necessarily, but to create a list with sentences in which all words and punctuation marks are separated from each other.

Any ideas? I only managed to create a list of comma-separated tokens, using this code:

import nltk
nltk.download('punkt')
from nltk import word_tokenize 
tokenized = []
for line in unique:
      tokenized.append(word_tokenize(line))

Solution

  • You can join the tokenized lines with a space, just use

    from nltk import word_tokenize
    unique = ['I have to get groceries.','I need some bananas.','Anything else?']
    tokenized = [" ".join(word_tokenize(line)) for line in unique]
    print(tokenized)
    # => ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']