Search code examples
pythonnlpspacy

AttributeError: 'spacy.tokens.span.Span' object has no attribute 'merge'


i'm working on an nlp project and trying to follow this tutorial https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e and while executing this part

import spacy

# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')

# Replace a token with "REDACTED" if it is a name
def replace_name_with_placeholder(token):
   if token.ent_iob != 0 and token.ent_type_ == "PERSON":
    return "[REDACTED] "
  else:
    return token.string

 # Loop through all the entities in a document and check if they are names
def scrub(text):
doc = nlp(text)
for ent in doc.ents:
    ent.merge()
tokens = map(replace_name_with_placeholder, doc)
return "".join(tokens)

s = """
In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". 
In 1957, Noam Chomsky’s 
 Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of 
 syntactic structures.
 """

 print(scrub(s))

this error appear

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-62-ab1c786c4914> in <module>
  4 """
  5 
  ----> 6 print(scrub(s))

<ipython-input-60-4742408aa60f> in scrub(text)
  3     doc = nlp(text)
  4     for ent in doc.ents:
  ----> 5         ent.merge()
  6     tokens = map(replace_name_with_placeholder, doc)
  7     return "".join(tokens)

 AttributeError: 'spacy.tokens.span.Span' object has no attribute 'merge'

Solution

  • Spacy did away with the span.merge() method since that tutorial was made. The way to do this now is by using doc.retokenize(): https://spacy.io/api/doc#retokenize. I implemented it for your scrub function below:

    # Loop through all the entities in a document and check if they are names
    def scrub(text):
        doc = nlp(text)
        with doc.retokenize() as retokenizer:
            for ent in doc.ents:
                retokenizer.merge(ent)
        tokens = map(replace_name_with_placeholder, doc)
        return "".join(tokens)
    
    s = """
    In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". 
    In 1957, Noam Chomsky’s 
     Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of 
     syntactic structures.
     """
    
    print(scrub(s))
    

    Other notes:

    1. Your replace_name_with_placeholder function will throw an error, use token.text instead, I fixed it below:

       def replace_name_with_placeholder(token):
           if token.ent_iob != 0 and token.ent_type_ == "PERSON":
               return "[REDACTED] "
           else:
               return token.text
      
    2. If you are extracting entities and, in addition, other spans like doc.noun_chunks, you may run into some issues such as this one:

       ValueError: [E102] Can't merge non-disjoint spans. 'Computing' is already part of 
       tokens to merge. If you want to find the longest non-overlapping spans, you can 
       use the util.filter_spans helper:
       https://spacy.io/api/top-level#util.filter_spans
      

      For this reason, you also may want to look into spacy.util.filter_spans: https://spacy.io/api/top-level#util.filter_spans.