Search code examples
python-3.xstringreplacenlpspacy

How to remove ORG names and GPE from noun chunk in spacy


I have the following code

import spacy
from spacy.tokens import Span
import en_core_web_lg
nlpsm = en_core_web_lg.load()

doc = nlpsm(text)

finalwor = []
    fil = [i for i in doc.ents if i.label_.lower() in ["person"]]
    fil_a = [i for i in doc.ents if i.label_.lower() in ['GPE']]
    fil_b = [i for i in doc.ents if i.label_.lower() in ['ORG']]
    for chunk in doc.noun_chunks:
        if chunk not in fil and chunk not in fil_a and chunk not in fil_b:
            finalwor=list(doc.noun_chunks)
            print("finalwor after noun_chunk", finalwor)
        else: 
            chunk in fil_a and chunk in fil_b
            entword=list(str(chunk.text).replace(str(chunk.text),""))
            finalwor.extend(entword)

I am not sure what I am doing wrong here. If the text is 'IT manager at Google'

My current output is "IT manager, Google'

Ideal output that I want is "IT manager".

Basically I want the company names and GPE names to replaced by empty string or just plainly just delete it.


Solution

  • I think here, finalwor=list(doc.noun_chunks), you are appending all the nouns that appear in your doc to the final word instead of just the noun that justifies your statement

    You might be looking for something like this:

    import spacy
    from spacy.tokens import Span
    import en_core_web_lg
    nlpsm = en_core_web_lg.load()
    
    doc = nlpsm('Maria, IT manager at Google and gardener')
    
    finalwor = []
    fil = [i for i in doc.ents if i.label_.lower() in ["person"]]
    fil_a = [i for i in doc.ents if i.label_.lower() in ['gpe']]
    fil_b = [i for i in doc.ents if i.label_.lower() in ['org']]
    
    for chunk in doc.noun_chunks:
        if chunk not in fil and chunk not in fil_a and chunk not in fil_b:
            finalwor.append(chunk)
    
    print("finalwor after noun_chunk", finalwor)
    

    finalwor after noun_chunk [IT manager, gardener]