Search code examples
spacyspacy-3

Does the PhraseMatcher in Spacy still work for wrong tokenization?


https://spacy.io/usage/rule-based-matching#phrasematcher

For this example:

    nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("He lives in Washington, D.C. and Boston. ")

The Doc says:

Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.

The reason that 'Washington, D.C.' can be matched successfully against the text without worrying about the tokenization is because the tokenization of 'Washington, D.C.' is correct. Let's say if the tokenization is like this below:

['in', 'Washington', ',',  'D.', 'C. and', 'Boston', '.']

My question is, if 'C. and' is tokenized as one token, does the match of 'Washington, D.C.' still successfully?


Solution

  • It doesn't matter how Washington, D.C. is tokenized internally as long the beginning and end of your phrase are token boundaries. In your example, it wouldn't match because C. and is one token (for some unusual reason?).

    So you also couldn't match Washing or ton D. and you couldn't match D.C (without the .) if D.C. is one token.