Search code examples
pythonnlpspacynamed-entity-recognition

Replacement entity with their entity label using spacy


I want to do for my data by replacing each entity with its label using Spacy and I have 3000 text rows needed to replace entities with their label entity,

for example:

"Georgia recently became the first U.S. state to "ban Muslim culture."

And want to become like this:

"GPE recently became the ORDINAL GPE state to "ban NORP culture. "

I want code to replace more than rows of text.

Thanks very much.

For example these codes but for one sentence, I want to modify s (string) to column contains 3000 rows

First one: from (Replace entity with its label in SpaCy)

s= "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
    start = e.start_char
    end = start + len(e.text)
    newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.

Second one: from (Merging tags into my file using named entity annotation)

import spacy

nlp = spacy.load("en_core_web_sm")
s ="Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(s)

def replaceSubstring(s, replacement, position, length_of_replaced):
    s = s[:position] + replacement + s[position+length_of_replaced:]
    return(s)

for ent in reversed(doc.ents):
    #print(ent.text, ent.start_char, ent.end_char, ent.label_)
    replacement = "<{}>{}</{}>".format(ent.label_,ent.text, ent.label_)
    position = ent.start_char
    length_of_replaced = ent.end_char - ent.start_char 
    s = replaceSubstring(s, replacement, position, length_of_replaced)

print(s)
#<ORG>Apple</ORG> is looking at buying <GPE>U.K.</GPE> startup for <MONEY>$1 billion</MONEY>

Solution

  • IIUC, you may achieve what you want with:

    1. Reading your texts from file, each text on its own line
    2. Processing results by substituting entities, if any, with their tags
    3. Writing results to disc, each text on its own line

    Demo:

    import spacy
    nlp = spacy.load("en_core_web_md")
    
    #read txt file, each string on its own line
    with open("./try.txt","r") as f:
        texts = f.read().splitlines()
    
    #substitute entities with their TAGS
    docs = nlp.pipe(texts)
    out = []
    for doc in docs:
        out_ = ""
        for tok in doc:
            text = tok.text
            if tok.ent_type_:
                text = tok.ent_type_
            out_ += text + tok.whitespace_
        out.append(out_)
    
    # write to file
    with open("./out_try.txt","w") as f:
        f.write("\n".join(out))
    

    Contents of input file:

    Georgia recently became the first U.S. state to "ban Muslim culture.
    His friend Nicolas J. Smith is here with Bart Simpon and Fred.
    Apple is looking at buying U.K. startup for $1 billion

    Contents of output file:

    GPE recently became the ORDINAL GPE state to "ban NORP culture.
    His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
    ORG is looking at buying GPE startup for MONEYMONEY MONEY

    Note the MONEYMONEY pattern.

    This is because:

    doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
    for tok in doc:
        print(f"{tok.text}, {tok.ent_type_}, whitespace='{tok.whitespace_}'")
    

    Apple, ORG, whitespace=' '
    is, , whitespace=' '
    looking, , whitespace=' '
    at, , whitespace=' '
    buying, , whitespace=' '
    U.K., GPE, whitespace=' '
    startup, , whitespace=' '
    for, , whitespace=' '
    $, MONEY, whitespace='' # <-- no whitespace between $ and 1
    1, MONEY, whitespace=' '
    billion, MONEY, whitespace=''