I'm using Python's nltk and I want to tokenize a sentence containing quotes, but it turns "
into ``
and ''
.
E.g:
>>> from nltk import word_tokenize
>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]
Why doesn't it keep the quotes like in the original sentence and how can this be solved?
Thanks
It's actually meant to do that, not on accident. From Penn Treebank Tokenization
double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
In previous version it didn't do that, but it was updated last year. In other words if you want to change you'll need to edit treebank.py