Search code examples
pythonnlptokennltktokenize

Getting word count in a sentence without punctuation marks NLTK python


I am trying to get the word count in a sentence with nltk in python

This is the code I wrote

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

for i in nltk.sent_tokenize(data):
    print(nltk.word_tokenize(i))

This was the output

['Sample', 'sentence', ',', 'for', 'checking', '.']
['Here', 'is', 'an', 'exclamation', 'mark', '!']
['Here', 'is', 'a', 'question', '?']
['This', 'is', "n't", 'an', 'easy-task', '.']

Is there any way to remove the punctuation marks, prevent isn't from splitting into two words and split easy-task into two?

The answer I need is something like ths:

['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy', 'task']

I can kind of manage punctuation marks by using stopwords like:

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

stopwords = [',', '.', '?', '!']

for i in nltk.sent_tokenize(data):
    for j in nltk.word_tokenize(i):
        if j not in stopwords:
            print(j, ', ', end="")
    print('\n')

output:

Sample , sentence , for , checking , 

Here , is , an , exclamation , mark , 

Here , is , a , question , 

This , is , n't , an , easy-task , 

but this does not fix isn't and easy-task. Is there a way to do this? Thank you


Solution

  • you can use different tokenizer which can take care of your requirement.

    import nltk
    import string
    tokenizer = nltk.TweetTokenizer()
    
    for i in nltk.sent_tokenize(data):
        print(i)
        print([x for x in tokenizer.tokenize(i) if x not in string.punctuation])
    
    #op
    ['Sample', 'sentence', 'for', 'checking']
    ['Here', 'is', 'an', 'exclamation', 'mark']
    ['Here', 'is', 'a', 'question']
    ['This', "isn't", 'an', 'easy-task']