java elasticsearch aws-elasticsearch elasticsearch-java-api elasticsearch-7

Elasticsearch MatchQuery is returning wrong results

I am using a matchQuery to query Elasticsearch in Java. Below is my query:

sourceBuilder.query(QueryBuilders.matchQuery("TransactionId_s","BulkRunTest.20Nov20201446.00"));

The field TransactionId_s is not a keyword. And I am expecting the matchQuery to match the exact string I have given and return the results. There should be no documents in Elasticsearch with TransactionId_s as BulkRunTest.20Nov20201446.00. But I am getting some results and they have the TransactionId_s like below:

"TransactionId_s" : "BulkRunTest.17Sep20201222.00"
"TransactionId_s" : "BulkRunTest.22Sep20201450.00"
"TransactionId_s" : "BulkRunTest.20Sep20201250.00"

When I tried using a termQuery instead of matchQuery, I am getting 0 results, which is the expected result. I thought matchQuery would allow me to query any field for the given value without me having to worry about tokenization. Am i wrong? And how do I resolve the issue I am seeing?

Any help would be much appreciated. Thank you.

Solution

Match queries are analyzed ie it applied the same analyzer which is used on the field at index time, you can analyzer API and see the tokens for indexed and search term.

Considering you have a text field with default analyzer(Standard) it will generate the below token for search term BulkRunTest.20Nov20201446.00

POST /_analyze
{
    "analyzer" : "standard",
    "text" : "BulkRunTest.20nov20201446.00"
}

And generated tokens

{
    "tokens": [
        {
            "token": "bulkruntest", // notice this token
            "start_offset": 0,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "20nov20201446.00",
            "start_offset": 12,
            "end_offset": 28,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

Now lets see the tokens for one of the matches doc BulkRunTest.17Sep20201222.00

POST /_analyze
{
    "analyzer" : "standard",
    "text" : "BulkRunTest.17Sep20201222.00"
}

And generated tokens

{
    "tokens": [
        {
            "token": "bulkruntest", // notice same token 
            "start_offset": 0,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "17sep20201222.00",
            "start_offset": 12,
            "end_offset": 28,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

As you can see bulkruntest is the same token in both indexed and search term, hence the match query returned the search result and same is with another indexed doc.

If you used the default auto-generated mapping and have .keyword subfield then you can use the .keyword field for the exact search.

Working example

{
  "query": {
    "term": {   // term query
      "TransactionId_s.keyword": {   // .keyword subfield is used
        "value": "BulkRunTest.20Nov20201446.00"
      }
    }
  }
}

And search result

"hits": [
            {
                "_index": "test_in",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.6931471,
                "_source": {
                    "TransactionId_s": "BulkRunTest.20Nov20201446.00"
                }
            }
        ]