python nlp spacy named-entity-recognition

Replacement entity with their entity label using spacy

I want to do for my data by replacing each entity with its label using Spacy and I have 3000 text rows needed to replace entities with their label entity,

for example:

"Georgia recently became the first U.S. state to "ban Muslim culture."

And want to become like this:

"GPE recently became the ORDINAL GPE state to "ban NORP culture. "

I want code to replace more than rows of text.

Thanks very much.

For example these codes but for one sentence, I want to modify s (string) to column contains 3000 rows

First one: from (Replace entity with its label in SpaCy)

s= "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
    start = e.start_char
    end = start + len(e.text)
    newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.

Second one: from (Merging tags into my file using named entity annotation)

import spacy

nlp = spacy.load("en_core_web_sm")
s ="Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(s)

def replaceSubstring(s, replacement, position, length_of_replaced):
    s = s[:position] + replacement + s[position+length_of_replaced:]
    return(s)

for ent in reversed(doc.ents):
    #print(ent.text, ent.start_char, ent.end_char, ent.label_)
    replacement = "<{}>{}</{}>".format(ent.label_,ent.text, ent.label_)
    position = ent.start_char
    length_of_replaced = ent.end_char - ent.start_char 
    s = replaceSubstring(s, replacement, position, length_of_replaced)

print(s)
#<ORG>Apple</ORG> is looking at buying <GPE>U.K.</GPE> startup for <MONEY>$1 billion</MONEY>

Solution

IIUC, you may achieve what you want with:

Reading your texts from file, each text on its own line
Processing results by substituting entities, if any, with their tags
Writing results to disc, each text on its own line

Demo:

import spacy
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
    out_ = ""
    for tok in doc:
        text = tok.text
        if tok.ent_type_:
            text = tok.ent_type_
        out_ += text + tok.whitespace_
    out.append(out_)

# write to file
with open("./out_try.txt","w") as f:
    f.write("\n".join(out))

Contents of input file:

Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for $1 billion

Contents of output file:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY

Note the MONEYMONEY pattern.

This is because:

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for tok in doc:
    print(f"{tok.text}, {tok.ent_type_}, whitespace='{tok.whitespace_}'")

Apple, ORG, whitespace=' '
is, , whitespace=' '
looking, , whitespace=' '
at, , whitespace=' '
buying, , whitespace=' '
U.K., GPE, whitespace=' '
startup, , whitespace=' '
for, , whitespace=' '
$, MONEY, whitespace='' # <-- no whitespace between $ and 1
1, MONEY, whitespace=' '
billion, MONEY, whitespace=''