I have a search index with 4 custom analyzers. Two of them are for language specific searching, and the other 2 are for "exact" searching (no need for lemmatization). For simplicity, I am including only the info for the language specific custom analyzers, although the overall solution will need to be applicable to all the custom analyzers.
{
"tokenizers": [
{
"@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"name": "text_language_search_custom_analyzer_ms_tokenizer",
"maxTokenLength": 300,
"isSearchTokenizer": false,
"language": "french"
},
{
"@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"name": "text_language_search_endsWith_custom_analyzer_ms_tokenizer",
"maxTokenLength": 300,
"isSearchTokenizer": false,
"language": "french"
}
],
"analyzers": [
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "text_language_search_custom_analyzer",
"tokenizer": "text_language_search_custom_analyzer_ms_tokenizer",
"tokenFilters": [
"lowercase",
"lang_text_synonym_token_filter",
"asciifolding"
],
"charFilters": [
"html_strip"
]
},
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "text_language_search_endsWith_custom_analyzer",
"tokenizer": "text_language_search_endsWith_custom_analyzer_ms_tokenizer",
"tokenFilters": [
"lowercase",
"lang_text_endsWith_synonym_token_filter",
"asciifolding",
"reverse"
],
"charFilters": [
"html_strip"
]
}
]
}
For simplicity, lets assume the index has only 2 searchable fields. - CategoryLangSearch (uses text_language_search_custom_analyzer) - CategoryLangSearchEndsWith (uses text_language_search_endsWith_custom_analyzer)
Now assume the index has only 1 document, with the following: - CategoryLangSearch field value of "TELECOMMUNICATIONS" - CategoryLangSearchEndsWith field value of "TELECOMMUNICATIONS"
Our UI/API layer has logic so if the user searches TELE*, it will now to use CategoryLangSearch as the field to search in. Likewise, our UI/API layer will detect if the user searches with an asterisk wildcard in the front. So if the user searches for *TIONS, the UI/API layer is smart enough to instead search against the CategoryLangSearchEndsWith field.
All that is great... it works exactly as intended.
The problem, however, is what can we do if the user searches with * COMMU * (ignore the spaces... S.O. treats the asterisks as signal for bold. The user types in asteriskCOMMUasterisk where asterisk is *)
I thought it would be "smart" if I built the azure search param like this: (CategoryLangSearch:(COMMU*) OR CategoryLangSearchEndsWith:(*UMMOC)) but, in practice, I found that this does not find TELECOMMUNICATIONS ORGANIZATION. This makes perfect sense when I see the query we build.
SO, my question is, how do we pull this off? Can we pull it off in Azure Search in anyway, shape or form? I don't see a path to success for this one. The only possible solution I could see is the following: 1. If user searches for something... 2. first query our MS SQL server directly to search using %something% syntax which is supported in SQL. 3. find the IDs the match, and then use THAT to search against Azure Search index.
There are two ways you can issue 'contains' search in Azure Search.
First approach is using regex expression in the Lucene query syntax. In your example, if you issue a regex query /.*COMMU.*/, the search query will first expand to all terms in the search index that contain the string 'commu' then find the result. You can issue the regex query against the field for "exact" matches. The search query would look like : docs?search=exact_field:/.*COMMU.*/&queryType=full.
The approach above is recommended if you have a small index because the query expansion process to find queried pattern is costly, especially for broad searches like /.*a.*/. You can preload the work by using a ngram tokenfilter at indexing time. The configuration for the tokenfilter will be as below.
{ "@odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2", "name": "ngram_tokenfilter", "minGram": 1, "maxGram": 100 }
Given a text "hello" for example, this tokenfilter generates ngram tokens as
h, e, l, l, o, he, el, ll, lo, hel, ell, ..., hello.
When querying against the new field analyzed with ngram tokenfilter, you do not need wildcard or regex operator, but can use a regular term search. The search query "docs?search=ell" will find the document containing the term "hello". This approach avoids the expensive expansion process because all the "contains" possibilities have been preprocessed, and exist in the index. Please note that you need the ngram analysis at indexing time only.
Please also note that this ngram analysis impact the size of the index as it produces more tokens. You can use parameters 'minGram' and 'maxGram' to control the size of the index.
Since you already have an API/UI that directs the search based on the positions of '*', the second option seems like a good approach.
Nate