Search code examples
azure-cognitive-search

Noise / stop words in the search query removes correct search results


We have documents in different languages. To be able to search in documents of different languages we created an index with one field per language. We make sure we fill the applicable field depending on the language of the document. (Other fields will be empty) We do not know in which language is searched, so we make sure to search all fields, so the applicable fields are always searched.

We have an issue when users supply a search query containing noise/stop words. It seems it removes perfectly valid search results from the result set when we use the searchMode=all and use a language analyzer. We have for instance the following text in our index to test this behavior: A document title with the and it in the name

When we use the following search query we get the expected search result: search=document title name&QueryType=full&searchMode=all&$count=true

However, when we try to search the exact title (or even add a few of the noise words like with, the and in) the results are not returned when we use the en.microsoft analyzer. When we use another language analyzer (which uses other noise/stop words) the results are returned. We have similar results using the nl.microsoft analyzer when using a dutch index and try to search for text which also contains dutch noise/stop words like "bij", "in" or "en" while this is part of the indexed text.

Is there some way to resolve this issue? Is this a bug in the search when using language analyzers? I would assume if we create a search query which searches an index which filtered noise/stop words, the noise/stop words would also be removed from the query by cognitive search before executing the search query.

Note: We also found the following stackoverflow post: Queries with stopwords and searchMode=all return no results It seems the issue only occurs when we search multiple fields with different languages. I can confirm this. If I test the search query by only searching the english field using the following query we get the expected result: search=document title name&QueryType=full&searchMode=all&searchFields=Title_enus&$count=true

However, when I try to search two fields which use a english and dutch language I do not get the english result anymore: search=document title name&QueryType=full&searchMode=all&searchFields=Title_enus,Title_nlnl&$count=true

Our actual situation is slightly different as in this post, since we search in multiple fields using an OR clause. I'll update this post if I did some more testing and can provide the exact test queries including their results.


Solution

  • Using an OR-query will work. You can also use searchMode any. From what I understand, your content is multi-language, with multiple languages per record.

    INDEX

        "fields": [
        {"name": "Id", "type": "Edm.String", "searchable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] }, 
        {"name": "Title_enus", "type": "Edm.String", "searchable": true, "analyzer":"en.microsoft"}, 
        {"name": "Title_nlnl", "type": "Edm.String", "searchable": true, "analyzer":"nl.microsoft"}],  
    

    CONTENT

    Using the example content from the article you link to, with your index definition.

        "value": [
        {
            "@search.action": "mergeOrUpload",
            "Id": "1",
            "Title_enus": "Waiting for a bus",
            "Title_nlnl": "Wachten op een bus"
        },
        {
            "@search.action": "mergeOrUpload",
            "Id": "2",
            "Title_enus": "Run to the hills",
            "Title_nlnl": "Ren naar de heuvels"
        }
    ]
    

    You don't know what language the end user input is in. Whatever the input is, you insert it into a query you have prepared. Consider the following examples

    1. search=wait for&$count=true&searchMode=all&queryType=full
    2. search=wait for&$count=true&searchMode=all&queryType=full&searchFields=Title_enus
    3. search=wait for&$count=true&searchMode=all&queryType=full&searchFields=Title_enus,Title_nlnl
    4. search=Title_enus:"wait for" OR Title_nlnl:"wait for"&$count=true&searchMode=all&queryType=full
    5. search=wait for&$count=true&searchMode=any&queryType=full

    In scenario 1, you search across both of your search properties. And all mode dictates that your terms wait for must exist in both. Since it does not exist in the Title_nlnl property, it's not a match

    In scenario 2, I specify that I only want to search within the Title_enus property. This is a match, because of wait matches. The term for is a stopword and thus ignored. I know this scenario won't work for you, since you want to enable users to search in all languages across all content. Nevertheless, it helps our understanding.

    In scenario 3, we want to search both Title_enus and Title_nlnl. This is effectively the same as scenario 1. For a record to match, the search terms must match both Title_enus AND Title_nlnl. There is no wait for in Title_nlnl (notice that for is not removed as a stopword, but also it does not matter).

    In scenario 4, we use an actual OR query. You take the user's input and your requirement is that it must match in either Title_enus OR Title_nlnl. Here, you get the record 1 as a match, as expected.

            "Id": "1",
            "Title_enus": "Waiting for a bus",
            "Title_nlnl": "Wachten op een bus"
    

    In scenario 5, we use any-mode. This will makes the search syntax simpler, and it returns the same as scenario 4.