Creating tokens from list of sentences is returning characters instead of words

from nltk.tokenize import sent_tokenize

text = open(path).read().lower().decode("utf8")
sent_tokenize_list = sent_tokenize(text)

tokens = [w for w in itertools.chain(*[sent for sent in sent_tokenize_list])]

The last line, "tokens", returns characters instead of words.

Why is this and how do I get it to return words instead? Especially considering doing it based on a list of sentences.

Solution

Because sent_tokenize returns a list of string sentences and itertools.chain chains iterables to a single iterable returning items one at a time from each until they're exhausted. In effect you've recombined the sentences to a single string and iterate over it in the list comprehension.

To create a single list of words from a list of sentences you can for example split and flatten:

tokens = [word for sent in sent_tokenize_list for word in sent.split()]

This does not handle punctuation, but your original attempt wouldn't either. Your original would work also with split:

tokens = [w for w in itertools.chain(*(sent.split()
                                       for sent in sent_tokenize_list))]

Note that you can use a generator expression instead of a list comprehension as arguments to unpack. Even better, use chain.from_iterable:

tokens = [w for w in itertools.chain.from_iterable(
    sent.split() for sent in sent_tokenize_list)]

For punctuation handling use nltk.tokenize.word_tokenize instead of str.split. It'll return words and punctuation as separate items, and splits for example I's to I and 's (which of course is a good thing since they're in fact separate words, just contracted).