from nltk.tokenize import sent_tokenize
text = open(path).read().lower().decode("utf8")
sent_tokenize_list = sent_tokenize(text)
tokens = [w for w in itertools.chain(*[sent for sent in sent_tokenize_list])]
The last line, "tokens", returns characters instead of words.
Why is this and how do I get it to return words instead? Especially considering doing it based on a list of sentences.
Because sent_tokenize
returns a list of string sentences and itertools.chain
chains iterables to a single iterable returning items one at a time from each until they're exhausted. In effect you've recombined the sentences to a single string and iterate over it in the list comprehension.
To create a single list of words from a list of sentences you can for example split and flatten:
tokens = [word for sent in sent_tokenize_list for word in sent.split()]
This does not handle punctuation, but your original attempt wouldn't either. Your original would work also with split:
tokens = [w for w in itertools.chain(*(sent.split()
for sent in sent_tokenize_list))]
Note that you can use a generator expression instead of a list comprehension as arguments to unpack. Even better, use chain.from_iterable
:
tokens = [w for w in itertools.chain.from_iterable(
sent.split() for sent in sent_tokenize_list)]
For punctuation handling use nltk.tokenize.word_tokenize
instead of str.split
. It'll return words and punctuation as separate items, and splits for example I's
to I
and 's
(which of course is a good thing since they're in fact separate words, just contracted).