I'm still new to Python and want to know how I can tokenize a list of strings without every word being separated by a comma.
For example, starting from a list like ['I have to get groceries.','I need some bananas.','Anything else?'], I want to obtain a list like this: ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']. The point is thus not to create a list with separate tokens necessarily, but to create a list with sentences in which all words and punctuation marks are separated from each other.
Any ideas? I only managed to create a list of comma-separated tokens, using this code:
import nltk
nltk.download('punkt')
from nltk import word_tokenize
tokenized = []
for line in unique:
tokenized.append(word_tokenize(line))
You can join the tokenized lines with a space, just use
from nltk import word_tokenize
unique = ['I have to get groceries.','I need some bananas.','Anything else?']
tokenized = [" ".join(word_tokenize(line)) for line in unique]
print(tokenized)
# => ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']