Search code examples
pythontokenizespacyprefix

Can I apply custom token rules to tokens split by prefixes in spaCy?


I customized a spaCy Tokenizer with additional rules and prefixes to treat w/ and f/ as with and for, respectively. The prefixes are correctly splitting them off, but the custom rules for lemmas and norms is not being applied in that case.

Here's an excerpt of the code.

def create_tokenizer(nlp):
    rules = dict(nlp.Defaults.tokenizer_exceptions)
    rules.update({
        'w/': [{ORTH: 'w/', LEMMA: 'with', NORM: 'with'}],
        'W/': [{ORTH: 'W/', LEMMA: 'with', NORM: 'with'}],
        'f/': [{ORTH: 'f/', LEMMA: 'for', NORM: 'for'}],
        'F/': [{ORTH: 'F/', LEMMA: 'for', NORM: 'for'}],
    })

    custom_prefixes = (
        r"[wW]/",
        r"[fF]/",
    )

    prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes + custom_prefixes)

    return Tokenizer(
        nlp.vocab,
        rules=rules,
        prefix_search=prefix_re.search,
    )

Here's the result.

>>> doc = nlp("This w/ that")
>>> doc[1]
w/
>>> doc[1].norm_
'with'
>>> doc = nlp("This w/that")
>>> doc[1]
w/
>>> doc[1].norm_
'w/'

In the case of This w/that, the w/ is getting split off, but it doesn't have the custom rules applied (i.e., the NORM is w/ instead of with). What do I need to do to have custom rules applied to tokens split off by prefixes/infixes/suffixes?


Solution

  • Unfortunately there's no way to have prefixes and suffixes also analyzed as exceptions in spacy v2. Tokenizer exceptions will be handled more generally in the upcoming spacy v3 release in order to support cases like this, but I don't know when the release might be at this point.

    I think the best you can do in spacy v2 is to have a quick postprocessing component that assigns the lemmas/norms to the individual tokens if they match the orth pattern.