How to customize tokenization of numbers by the en.microsoft analyzer?

We are now using Azure search Microsoft language analyzers on some of language specific fields. In most of cases, it has better relevance than standard Lucene language analyzers. But we found an issue when verifying en.microsoft analyzer.

The problem is, if the field value contains digits. The analyzer is smart to allow redundant “0” in front the digit.

For example:

POST /analyze?api-version=2017-11-11
  {
    "text": "1",
    "analyzer": "en.microsoft"

  }

The response is:

    "tokens": [
        {
            "token": "1",
            "startOffset": 0,
            "endOffset": 2,
            "position": 0
        },
        {
            "token": "nn1",
            "startOffset": 0,
            "endOffset": 2,
            "position": 0
        }
]

The problem is, that if the field value is “01”, then all text like “01”, “001”, “0001”, … will match that field.

We have a field to save the product attribute name/value pairs, for example, “brand:Contoso|size:1”. Then even searching “0001” can return the document with this field value. This is not what we want.

So, my question is, is there any way to customize the en.microsoft analyzer so that, we can take advantage of the powerful stemmer of the analyzer but avoid the auto “0” padding in front of the digit?

Solution

Unfortunately, you can't change how the Microsoft tokenizers normalize numbers. To workaround this limitation you could choose a different analyzer for product attributes or add a character filter to your analyzer configuration that encodes the numeric characters so the tokenizer ignores them, for example, map each digit into a character from outside your expected character set using the MappingCharFilter.You can find examples here, use MicrosoftLanguageStemmingTokenizer as your tokenizer.