Search code examples
pythontextnlpnltktokenize

Creating tokens from list of sentences is returning characters instead of words


from nltk.tokenize import sent_tokenize

text = open(path).read().lower().decode("utf8")
sent_tokenize_list = sent_tokenize(text)

tokens = [w for w in itertools.chain(*[sent for sent in sent_tokenize_list])]

The last line, "tokens", returns characters instead of words.

Why is this and how do I get it to return words instead? Especially considering doing it based on a list of sentences.


Solution

  • Because sent_tokenize returns a list of string sentences and itertools.chain chains iterables to a single iterable returning items one at a time from each until they're exhausted. In effect you've recombined the sentences to a single string and iterate over it in the list comprehension.

    To create a single list of words from a list of sentences you can for example split and flatten:

    tokens = [word for sent in sent_tokenize_list for word in sent.split()]
    

    This does not handle punctuation, but your original attempt wouldn't either. Your original would work also with split:

    tokens = [w for w in itertools.chain(*(sent.split()
                                           for sent in sent_tokenize_list))]
    

    Note that you can use a generator expression instead of a list comprehension as arguments to unpack. Even better, use chain.from_iterable:

    tokens = [w for w in itertools.chain.from_iterable(
        sent.split() for sent in sent_tokenize_list)]
    

    For punctuation handling use nltk.tokenize.word_tokenize instead of str.split. It'll return words and punctuation as separate items, and splits for example I's to I and 's (which of course is a good thing since they're in fact separate words, just contracted).