Search code examples
pythonnlptokenizespacy

spaCy: what is NORM-part of tokenizer_exceptions?


I'm adding tokenizer_exceptions for my language. I was looking at 'gonna' example for English language, so wrote the rule as follows:

'т.п.': [
    {ORTH: "т.", NORM: "тому", LEMMA: "тот"},
    {ORTH: "п.", NORM: "подобное", LEMMA: "подобный"}
],

Then when I tokenize, I expect that NORM-parts of rule will be in token.norm_ (though there is no any documentation about Token.norm_). But instead I see ORTH-part in token.norm_ and nowhere in the token-instance I could see the NORM-part of rule.

So what is Token.norm_-member and what is NORM-part of tokenizer_exceptions-rule for?


Solution

  • To answer the question more generally: In spaCy v1.x, the NORM is mostly used to supply a "normalised" form of a token, for example the full inflected form if the token text is "incomplete" (like in the gonna example), or an alternate spelling. The main purpose of the norm in v1.x is making it accessible as the .norm_ attribute for future reference.

    However, in v2.x, currently in alpha, the NORM attribute becomes more relevant, as it's also used as a feature in the model. This lets you normalise words with different spellings to one common spelling and ensures that those words receive similar representations – even if one of them is less frequent in your training data. Examples of this are American vs. British spelling in English, or the currency symbols, which are all normalised to $. To make this easier, v2.0 introduces a new language data component, the norm exceptions.

    If you're working on your own language models, I'd definitely recommend checking out v2.0 alpha (which is pretty close to a first release candidate now).