Search code examples
azure-cognitive-search

How do I add a char filter to a Microsoft language analyzer in Azure Search?


we want to use the language specific analyzers provided by azure search, but add the html_char filter from Lucene. Our idea was to build a custom analyzer that uses the same components (tokenizer, filters) as for example the en.microsoft analyzer but add the additional char filter.

Sadly we can't find any documentation on what exactly constitutes the en.microsoft analyzer or any other Microsoft analyzer. We do not know which tokenizers or filters to use to get the same result with a custom analyzer.

Can anyone point us in to the right documentation?

The documentation says that the en.microsoft analyzer performs lemmatization instead of stemming but I can't find any tokenizer or filter that claims to use lemmatization only stemmers.


Solution

  • To create a customized version of a Microsoft analyzer, start with the Microsoft tokenizer for a given language (we have a stemming and non-stemming version), and add token filters from the set of available token filters to customize the output token stream. Note that the stemming tokenizer also does lemmatization, depending on the language.

    In most cases, a Microsoft language analyzer is a Microsoft tokenizer plus a stopwords token filter and a lowercase token filter, but this varies depending on the language. In some cases we do language specific character normalization.

    We recommend using the above as a starting point. The Analyze API can then be used for testing your configuration to see if it gives you the results you want.