Code:
import spacy
nlp = spacy.load("en_core_web_md")
#read txt file, each string on its own line
with open("./try.txt","r") as f:
texts = f.read().splitlines()
#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
out_ = ""
for tok in doc:
text = tok.text
if tok.ent_type_:
text = tok.ent_type_
out_ += text + tok.whitespace_
out.append(out_)
# write to file
with open("./out_try.txt","w") as f:
f.write("\n".join(out))
Contents of input file:
Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for $1 billion
Contents of output file:
GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY
I need to avoid this problem in above sentences. for example in (in sentence 2 'PERSON PERSON PERSON' to become one entity PERSON.
Lets try:
import spacy
from spacy.gold import biluo_tags_from_offsets, spans_from_biluo_tags
nlp = spacy.load("en_core_web_md")
#read txt file, each string on its own line
with open("./try.txt","r") as f:
texts = f.read().splitlines()
docs = nlp.pipe(texts)
out_text = ""
for doc in docs:
offsets = []
for ent in doc.ents:
offsets.append((ent.start_char, ent.end_char, ent.label_))
tags = biluo_tags_from_offsets(doc, offsets)
text = *zip([tok for tok in doc],tags),
out = []
for item in text:
tag = item[1].split("-")
if tag[0] == "O":
out.append(item[0].text+item[0].whitespace_)
if tag[0] == "U":
out.append(item[0].ent_type_+item[0].whitespace_)
elif tag[0] == "L":
out.append(item[0].ent_type_+item[0].whitespace_)
out_text += "".join(out)+"\n"
with open("out_try.txt","w") as f:
f.write(out_text)
Contents of the output file:
GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON is here with PERSON and PERSON.
ORG is looking at buying GPE startup for MONEY