I want to do for my data by replacing each entity with its label using Spacy and I have 3000 text rows needed to replace entities with their label entity,
for example:
"Georgia recently became the first U.S. state to "ban Muslim culture."
And want to become like this:
"GPE recently became the ORDINAL GPE state to "ban NORP culture. "
I want code to replace more than rows of text.
Thanks very much.
For example these codes but for one sentence, I want to modify s (string) to column contains 3000 rows
First one: from (Replace entity with its label in SpaCy)
s= "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
start = e.start_char
end = start + len(e.text)
newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.
Second one: from (Merging tags into my file using named entity annotation)
import spacy
nlp = spacy.load("en_core_web_sm")
s ="Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(s)
def replaceSubstring(s, replacement, position, length_of_replaced):
s = s[:position] + replacement + s[position+length_of_replaced:]
return(s)
for ent in reversed(doc.ents):
#print(ent.text, ent.start_char, ent.end_char, ent.label_)
replacement = "<{}>{}</{}>".format(ent.label_,ent.text, ent.label_)
position = ent.start_char
length_of_replaced = ent.end_char - ent.start_char
s = replaceSubstring(s, replacement, position, length_of_replaced)
print(s)
#<ORG>Apple</ORG> is looking at buying <GPE>U.K.</GPE> startup for <MONEY>$1 billion</MONEY>
IIUC, you may achieve what you want with:
Demo:
import spacy
nlp = spacy.load("en_core_web_md")
#read txt file, each string on its own line
with open("./try.txt","r") as f:
texts = f.read().splitlines()
#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
out_ = ""
for tok in doc:
text = tok.text
if tok.ent_type_:
text = tok.ent_type_
out_ += text + tok.whitespace_
out.append(out_)
# write to file
with open("./out_try.txt","w") as f:
f.write("\n".join(out))
Contents of input file:
Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for $1 billion
Contents of output file:
GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY
Note the MONEYMONEY
pattern.
This is because:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for tok in doc:
print(f"{tok.text}, {tok.ent_type_}, whitespace='{tok.whitespace_}'")
Apple, ORG, whitespace=' '
is, , whitespace=' '
looking, , whitespace=' '
at, , whitespace=' '
buying, , whitespace=' '
U.K., GPE, whitespace=' '
startup, , whitespace=' '
for, , whitespace=' '
$, MONEY, whitespace='' # <-- no whitespace between $ and 1
1, MONEY, whitespace=' '
billion, MONEY, whitespace=''