Search code examples
nlptokenizespacynamed-entity-recognition

Tokenizing Named Entities in Spacy


can anyone assist please.

I'm attempting to tokenize a document using Spacy whereby named entities are tokenised. For example:

'New York is a city in the United States of America'

would be tokenized as:

['New York', 'is', 'a', 'city', 'in', 'the', 'United States of America']

Any tips on how to do this are very welcome. Have looked at using span.merge(), but with no success, but I am new to coding so am likely to have missed something.

Thank you in advance


Solution

  • Use the doc.retokenize context manager to merge entity spans into single tokens. Wrap this in a custom pipeline component, and add the component to your language model.

    import spacy
    
    class EntityRetokenizeComponent:
      def __init__(self, nlp):
        pass
      def __call__(self, doc):
        with doc.retokenize() as retokenizer:
            for ent in doc.ents:
                retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
        return doc
    
    nlp = spacy.load('en')
    retokenizer = EntityRetokenizeComponent(nlp) 
    nlp.add_pipe(retokenizer, name='merge_phrases', last=True)
    
    doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
              "converse in the Oval Office inside the White House in Washington, D.C.")
    
    [tok for tok in doc]
    
    #[German,
    # Chancellor,
    # Angela Merkel,
    # and,
    # US,
    # President,
    # Barack Obama,
    # converse,
    # in,
    # the Oval Office,
    # inside,
    # the White House,
    # in,
    # Washington,
    # ,,
    # D.C.]