can anyone assist please.
I'm attempting to tokenize a document using Spacy whereby named entities are tokenised. For example:
'New York is a city in the United States of America'
would be tokenized as:
['New York', 'is', 'a', 'city', 'in', 'the', 'United States of America']
Any tips on how to do this are very welcome. Have looked at using span.merge(), but with no success, but I am new to coding so am likely to have missed something.
Thank you in advance
Use the doc.retokenize
context manager to merge entity spans into single tokens. Wrap this in a custom pipeline component, and add the component to your language model.
import spacy
class EntityRetokenizeComponent:
def __init__(self, nlp):
pass
def __call__(self, doc):
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
return doc
nlp = spacy.load('en')
retokenizer = EntityRetokenizeComponent(nlp)
nlp.add_pipe(retokenizer, name='merge_phrases', last=True)
doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
"converse in the Oval Office inside the White House in Washington, D.C.")
[tok for tok in doc]
#[German,
# Chancellor,
# Angela Merkel,
# and,
# US,
# President,
# Barack Obama,
# converse,
# in,
# the Oval Office,
# inside,
# the White House,
# in,
# Washington,
# ,,
# D.C.]