Search code examples
pythonnlpspacy

spacy tokenizer is not recognizing period as suffix consistently


I have been working on a custom NER model to extract products that have strange identifiers that I can't control.

You can see from this example that in some cases it isn't picking up the period as a suffix. I added a custom tokenizer to handle products with hyphens (below). What do I need to add to handle this case and not jeopardize the other existing tokenization? Any input would be appreciated.

issue_text = "I really like stereo receivers, I want to buy the new ASX8E11F." 
print(nlp_custom_ner.tokenizer.explain(issue_text))

issue_text = "I really like stereo receivers, I want to buy the new RK8BX." 
print(nlp_custom_ner.tokenizer.explain(issue_text))

Output

[('TOKEN', 'I'), ('TOKEN', 'really'), ('TOKEN', 'like'), ('TOKEN', 'stereo'), ('TOKEN', 'receivers'), ('SUFFIX', ','), ('TOKEN', 'I'), ('TOKEN', 'want'), ('TOKEN', 'to'), ('TOKEN', 'buy'), ('TOKEN', 'the'), ('TOKEN', 'new'), ('TOKEN', 'ASX8E11F.')]

[('TOKEN', 'I'), ('TOKEN', 'really'), ('TOKEN', 'like'), ('TOKEN', 'stereo'), ('TOKEN', 'receivers'), ('SUFFIX', ','), ('TOKEN', 'I'), ('TOKEN', 'want'), ('TOKEN', 'to'), ('TOKEN', 'buy'), ('TOKEN', 'the'), ('TOKEN', 'new'), ('TOKEN', 'RK8BX'), ('SUFFIX', '.')]

I added a custom infix tokenizer to handle products with hyphens that is working.

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

# Default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("AXDR-PXXT-001")
print([t.text for t in doc])

# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("AXDR-PXXT-001")
print([t.text for t in doc])

Output

['AXDR', '-', 'PXXT-001']
['AXDR-PXXT-001']

Solution

  • Similar to the example with modified infixes, you need to look at the current suffix patterns and edit the rules that lead to this suffix.

    For this particular case, it's probably this rule from the general suffix rules:

    r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER)