Search code examples
pythonnltktokenize

How to avoid tokenize words with underscore?


I am trying to tokenize my texts by using "nltk.word_tokenize()" function, but it would split words connected by "_".

For example, the text "A,_B_C! is a movie!" would be split into:

['a', ',', '_b_c', '!', 'is','a','movie','!']

The result I want is:

['a,_b_c!', 'is', 'a', 'movie', '!']

My code:

import nltk

text = "A,_B_C! is a movie!"
nltk.tokenize(text.lower())

Any help would be appreciated!


Solution

  • You can first split it using space and then use word_tokenize on each word to handle punctuations

    [word for sublist in [word_tokenize(x) if '_' not in x else [x] 
                           for x in text.lower().split()] for word in sublist] 
    

    Output ['a,_b_c!', 'is', 'a', 'movie', '!']

    l = [word_tokenize(x) if '_' not in x else [x] for x in text.lower().split()] will return a list of list running word_tokenize only on words which dont have _.

    [word for sublist in l for word in sublist] part is to flatten the list of list into a single list.