Search code examples
pythonnltktokenize

nltk: word_tokenize changes quotes


I'm using Python's nltk and I want to tokenize a sentence containing quotes, but it turns " into `` and ''.

E.g:

>>> from nltk import word_tokenize

>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]

Why doesn't it keep the quotes like in the original sentence and how can this be solved?

Thanks


Solution

  • It's actually meant to do that, not on accident. From Penn Treebank Tokenization

    double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')

    In previous version it didn't do that, but it was updated last year. In other words if you want to change you'll need to edit treebank.py