Search code examples
gensimstop-words

deleting stopwords with Gensim


I'm trying to learn Gensim using its site. There is a function named 'remove_stopword_tokens' which is useful for my research. Now, although the module is defined and is present on their website (exact link: link),I can't import it on my colab

Note: This is my code:

import gensim
from gensim.parsing.preprocessing import remove_stopword_tokens

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-dbd838c83237> in <module>
----> 1 from gensim.parsing.preprocessing import remove_stopword_tokens

ImportError: cannot import name 'remove_stopword_tokens' from 'gensim.parsing.preprocessing' (/usr/local/lib/python3.7/dist-packages/gensim/parsing/preprocessing.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Solution

  • updated & corrected answer

    You've run into a limitation of Google Colab - it may not have the most-recent version of libraries.

    You can see this by checking what the value of gensim.__version__ is. In my check of Google Colab right now (September 2022), it reports 3.6.0 – a version of Gensim that's about 4 years old, and lacks later fixes & addtions. The remove_stopwords_tokens() function was only added recently.

    Fortunately, you can update the gensim package backing the Colab notebook yourself, using a shell-escape to run pip. Inside a Colab cell, run:

    !pip install gensim -U
    

    If you'd already done an import gensim, it will warn you that you must restart the runtime for the new code to be found.

    Note that for clarity reasons you might choose to prefer using more-specific imports, as many project style guides suggest, rather than doing any broad top-level import gensim at all. Just mport the individual classes and/or functions you need, specifically & explicitly. That is, just:

    from gensim.parsing.preprocessing import remove_stopword_tokens
    # ... other exact class/function/variable imports you'll use...
    
    remove_stopword_tokens(sentence)
    

    On the other hand, if you want things simple-but-sloppy (not recommended), once you import gensim, it has already (via its own custom initialization routines) imported all of its submodules for you. So you could do:

    import gensim  # parsing & all gensim's other submodules now referenceable!
    
    gensim.parsing.remove_stopword_tokens(sentence)
    

    (Pro Python programmer style tends not to do this latter approach, of prefixing all in-the-actual-code calls with long dot-paths.)