I am trying to tokenize my texts by using "nltk.word_tokenize()
" function, but it would split words connected by "_
".
For example, the text "A,_B_C! is a movie!
" would be split into:
['a', ',', '_b_c', '!', 'is','a','movie','!']
The result I want is:
['a,_b_c!', 'is', 'a', 'movie', '!']
My code:
import nltk
text = "A,_B_C! is a movie!"
nltk.tokenize(text.lower())
Any help would be appreciated!
You can first split it using space and then use word_tokenize
on each word to handle punctuations
[word for sublist in [word_tokenize(x) if '_' not in x else [x]
for x in text.lower().split()] for word in sublist]
Output
['a,_b_c!', 'is', 'a', 'movie', '!']
l = [word_tokenize(x) if '_' not in x else [x] for x in text.lower().split()]
will return a list of list running word_tokenize
only on words which dont have _
.
[word for sublist in l for word in sublist]
part is to flatten the list of list into a single list.