Search code examples
spacyspacy-3

Spacy: Detect if entity is quoted


I need to detect if a given entity is surrounded by quotes, either single or double quotes. How would I go about this.

My first thought was to add a custom extension to the span:

def is_quoted(span):
   prev_token = span.doc[span.start - 1]
   next_token = span.doc[span.end + 1]

   return prev_token in ["\"", "'"] and next_token in ["\"", "'"]

Span.set_extension("is_quoted", getter=is_quoted)

But would this really be the most efficient way of doing this? I only want to do this on entities.

Or am I better of just writing a custom matcher, with a specific regex? But this will then run on my entire document.


Solution

  • Your custom extension looks fine if you only care about quotes immediately before and after. You just need to handle the case where the span is at the start or end of the doc correctly, which you aren't doing now - if the span is at the start you'll check doc[-1], the last token, for example.

    Do you care about things like John said, "I never though I'd meet Peter Smith!", where "Peter Smith" is an entity? If so I would figure out a policy for nested quotes (maybe just ignore them if rare) and create an extension that walks through each sentence and marks each token as in quotes, or not in quotes (with quotes themselves defined however you want).

    If you care about complex cases I wouldn't use a Matcher for this - it can't handle nesting well, and I think any solution with it would be more complicated than basic state tracking. (If you only care about immediately before and after it should be fine.)