nlp tokenize spacy named-entity-recognition

Tokenizing Named Entities in Spacy

can anyone assist please.

I'm attempting to tokenize a document using Spacy whereby named entities are tokenised. For example:

'New York is a city in the United States of America'

would be tokenized as:

['New York', 'is', 'a', 'city', 'in', 'the', 'United States of America']

Any tips on how to do this are very welcome. Have looked at using span.merge(), but with no success, but I am new to coding so am likely to have missed something.

Thank you in advance

Solution

Use the doc.retokenize context manager to merge entity spans into single tokens. Wrap this in a custom pipeline component, and add the component to your language model.

import spacy

class EntityRetokenizeComponent:
  def __init__(self, nlp):
    pass
  def __call__(self, doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
    return doc

nlp = spacy.load('en')
retokenizer = EntityRetokenizeComponent(nlp) 
nlp.add_pipe(retokenizer, name='merge_phrases', last=True)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")

[tok for tok in doc]

#[German,
# Chancellor,
# Angela Merkel,
# and,
# US,
# President,
# Barack Obama,
# converse,
# in,
# the Oval Office,
# inside,
# the White House,
# in,
# Washington,
# ,,
# D.C.]