Search code examples
nltktokenize

nltk word_tokenize returns ordered words?


If I run the following code:

from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

I get this output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

In this case, the tokens in the list are appearing in the same order as they are in the input sentence.

However, are they always in the same order as in the input sentence ?


Solution

  • Yes, they are always in the same order as in the input sentence.

    The method word_tokenize calls re.findall. Regular expression documentation about re.findall states the following.

    Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

    References:
    https://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize (search word_tokenize on this page)
    https://docs.python.org/3/library/re.html (search findall on this page)
    https://docs.python.org/2/library/re.html (search findall on this page)