i'm working on an nlp project and trying to follow this tutorial https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e and while executing this part
import spacy
# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')
# Replace a token with "REDACTED" if it is a name
def replace_name_with_placeholder(token):
if token.ent_iob != 0 and token.ent_type_ == "PERSON":
return "[REDACTED] "
else:
return token.string
# Loop through all the entities in a document and check if they are names
def scrub(text):
doc = nlp(text)
for ent in doc.ents:
ent.merge()
tokens = map(replace_name_with_placeholder, doc)
return "".join(tokens)
s = """
In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence".
In 1957, Noam Chomsky’s
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of
syntactic structures.
"""
print(scrub(s))
this error appear
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-62-ab1c786c4914> in <module>
4 """
5
----> 6 print(scrub(s))
<ipython-input-60-4742408aa60f> in scrub(text)
3 doc = nlp(text)
4 for ent in doc.ents:
----> 5 ent.merge()
6 tokens = map(replace_name_with_placeholder, doc)
7 return "".join(tokens)
AttributeError: 'spacy.tokens.span.Span' object has no attribute 'merge'
Spacy did away with the span.merge()
method since that tutorial was made. The way to do this now is by using doc.retokenize()
: https://spacy.io/api/doc#retokenize. I implemented it for your scrub
function below:
# Loop through all the entities in a document and check if they are names
def scrub(text):
doc = nlp(text)
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(ent)
tokens = map(replace_name_with_placeholder, doc)
return "".join(tokens)
s = """
In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence".
In 1957, Noam Chomsky’s
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of
syntactic structures.
"""
print(scrub(s))
Other notes:
Your replace_name_with_placeholder
function will throw an error, use token.text
instead, I fixed it below:
def replace_name_with_placeholder(token):
if token.ent_iob != 0 and token.ent_type_ == "PERSON":
return "[REDACTED] "
else:
return token.text
If you are extracting entities and, in addition, other spans like doc.noun_chunks
, you may run into some issues such as this one:
ValueError: [E102] Can't merge non-disjoint spans. 'Computing' is already part of
tokens to merge. If you want to find the longest non-overlapping spans, you can
use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans
For this reason, you also may want to look into spacy.util.filter_spans
:
https://spacy.io/api/top-level#util.filter_spans.