Search code examples
python-3.xnltktokenize

NLTK tokenizes a quote sentence into two


Here is the code:

x = '"What do you mean?" asked Jack, looking down.'
nltk.tokenize.sent_tokenize(x)

Here is the output:

['"What do you mean?"', 'asked Jack, looking down.']

What I would like to get:

['"What do you mean?" asked Jack, looking down.']

I am not sure how to fix the issue, any help would be appreciated! Thanks!


Solution

  • You are using 'sent_tokenize()' which is creating sentences as tokens. And it observes '?' question-mark and '.' full-stop as end-of sentences, that is why it is creating 2 tokens from your given string.

    Read about NLTK tokenizers here - https://www.nltk.org/api/nltk.tokenize.html

    For your expected output, given the sentence in question, you may do-

    x.split(',')