Search code examples
pythontextspacywordnet

Meaningless Spacy Nouns


I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.

Here is the code that I am using:

Code

import spacy
import re

nlp = spacy.load("en_core_web_sm")

sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")

doc= nlp(cleanString)

for token in doc:
    if token.pos_=="NOUN":
        print (token.text)
 

Output:

sfx

Similarly for sentence "fast foward2", I get Spacy noun as

foward2

Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.

I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.

I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc

Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.


Solution

  • It seems you can use pyenchant library:

    Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.

    More information is available on the Enchant website:

    https://abiword.github.io/enchant/

    Sample Python code:

    import spacy, re
    import enchant                        #pip install pyenchant
    
    d = enchant.Dict("en_US")
    nlp = spacy.load("en_core_web_sm")
    
    sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
    cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
    
    doc= nlp(cleanString)
    for token in doc:
        if token.pos_=="NOUN" and d.check(token.text):
            print (token.text)
    # => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]