Remove junk word from large sized token in NLTK

I am stuck with processing large size text file..

Scenario: The text file is converted to token and its a list,whose length is 250000

And I want to remove the junk word from it. For which I am using nltk and list comprehension.

But for list size 100 its list comprehension takes 10sec.

from nltk.corpus import stopwords,words

strt_time = time.time()
no_junk = [x for x in vocab_temp if x in words.words()]
print(time.time() - strt_time)
9.56

so for complete set it would be hours.

how to optimize this?

Solution

This is because in your list comprehension, you are calling words.words() each iteration. Since this doesn't change for each comparison, you can just move this outside the loop.

from nltk.corpus import stopwords,words
import nltk

nltk.download('words')

vocab_temp =  ['hello world'] * 100
keep_words = words.words() 

[x for x in vocab_temp if x in keep_words]

Access the Alexa Shopping and To-Do Lists with Python3 request module
How to change the button colors onclick event?
hash method implementation not working along set() [Python]
How can I make the layout of the interface fit other screen resolutions,?
Python class function return super()
Unable to import Pandas Profiling
Django one of 2 fields must not be null
problem with insert GIF in python(TKinter)?
Pythonic way to hex dump files
Error <Figure size 1000x600 with 1 Axes> even with plt.figure () before plt.plot
On GitHub actions, "pip install playsound" failed with the "OSError: could not get source code" error
Raspberry Pi (Python) - button press (and hold) to run a script loop
how to create a venv with a different python version
How does defining what a function will return work
why does my python asyncio task report that it is done when it hasn't even been executed?
django.db.utils.ProgrammingError: column "role" of relation "APP_profile" does not exist
Python pip error "externally managed environment" after upgrading to Ubuntu 23.04
Is there any way to catch and handle infinite recursive functions that create Tkinter windows?
Why does casting a column with numeric Categorical datatype to an integer in Polars result in unexpected behavior?
Cannot find play button on a web page in selenium using python
Dynamically change one of f-string variables
speed up my function about build bill of materials with pandas
How to find values that can be found in other columns in polars quickly
Teardown method from add_finalizer of PyTest fixture doesn't work
How to pass variables during pytest call via python script(not command line)?
Unit testcase with parameterized class fixture in Pytest framework is throwing error
How to upload photo into main album at my community via vk_api
Selenium (Python): No such driver version 115.0.5790.114 for mac-arm64
How to use User Profile With SeleniumBase?
Selenium undetected chromedriver with different chrome-versions?