Search code examples
pythonregexstop-words

Faster way to remove stop words in Python


I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

Note: I tried someone's suggest of wrapping stopwords.words('english') with set() but that made no difference.

Thank you.


Solution

  • Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

        from nltk.corpus import stopwords
    
        cachedStopWords = stopwords.words("english")
    
        def testFuncOld():
            text = 'hello bye the the hi'
            text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])
    
        def testFuncNew():
            text = 'hello bye the the hi'
            text = ' '.join([word for word in text.split() if word not in cachedStopWords])
    
        if __name__ == "__main__":
            for i in xrange(10000):
                testFuncOld()
                testFuncNew()
    

    I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

    nCalls Cumulative Time

    10000 7.723 words.py:7(testFuncOld)

    10000 0.140 words.py:11(testFuncNew)

    So, caching the stopwords instance gives a ~70x speedup.