Search code examples
python-3.xnlpnltklist-comprehension

Remove junk word from large sized token in NLTK


I am stuck with processing large size text file..

Scenario: The text file is converted to token and its a list,whose length is 250000

And I want to remove the junk word from it. For which I am using nltk and list comprehension.

But for list size 100 its list comprehension takes 10sec.

from nltk.corpus import stopwords,words

strt_time = time.time()
no_junk = [x for x in vocab_temp if x in words.words()]
print(time.time() - strt_time)
9.56

so for complete set it would be hours.

how to optimize this?


Solution

  • This is because in your list comprehension, you are calling words.words() each iteration. Since this doesn't change for each comparison, you can just move this outside the loop.

    from nltk.corpus import stopwords,words
    import nltk
    
    nltk.download('words')
    
    vocab_temp =  ['hello world'] * 100
    keep_words = words.words() 
    
    [x for x in vocab_temp if x in keep_words]