Search code examples
pythonpython-3.xnlpnltkstemming

Is there a quicker snowball stemmer in python 3.6 than NLTK's?


I am currently using NLTK's SnowballStemmer to stem the words in my documents and this was working fine when I had 68 documents. Now I have 4000 documents and this is way too slow. I read another post where someone suggested to use PyStemmer, but this is not offered on Python 3.6 Are there any other packages that would do the trick? Or maybe there's something I can do in the code to speed up the process.

Code:

eng_stemmer = nltk.stem.SnowballStemmer('english')
...
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([eng_stemmer.stem(w) for w in analyzer(doc)])

Solution

  • PyStemmer does not say that it works with python 3.6 in its documentation but it actually does. Install the proper Visual Studio C++ Build compatible with python 3.6 which you can find here: http://landinghub.visualstudio.com/visual-cpp-build-tools

    And then try pip install pystemmer

    If that doesn't work then make sure you install manually exactly as it says here: https://github.com/snowballstem/pystemmer