I am stuck with processing large size text file..
Scenario: The text file is converted to token and its a list,whose length is 250000
And I want to remove the junk word from it. For which I am using nltk and list comprehension.
But for list size 100 its list comprehension takes 10sec.
from nltk.corpus import stopwords,words
strt_time = time.time()
no_junk = [x for x in vocab_temp if x in words.words()]
print(time.time() - strt_time)
9.56
so for complete set it would be hours.
how to optimize this?
This is because in your list comprehension, you are calling words.words()
each iteration. Since this doesn't change for each comparison, you can just move this outside the loop.
from nltk.corpus import stopwords,words
import nltk
nltk.download('words')
vocab_temp = ['hello world'] * 100
keep_words = words.words()
[x for x in vocab_temp if x in keep_words]