Search code examples
javaelasticsearchelasticsearch-java-api

Elasticsearch must_not not working with filter clause


I am working on medical project, where I have multiple questions along with the topics attached to them. The problem is that the following code works fine but it does not consider the 'must_not' filter, whereas it works fine with 'must' clause. Help me out with this.

GET stopdata/_search
{
  "query": {
    "function_score": {
      "query": {
        "filtered": {
          "query": {
            "match": {
              "question": "Hello Dr. Iam suffering from fever with cough nd cold since 3 days"
            }
          }
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "terms": {
                "topics": [
                  "fever",
                  "cough"
                ]
              }
            }
          ],
          "must_not": [
            {
              "terms": {
                "topics": [
                  "children",
                  "child",
                  "childrens health"
                ]
              }
            }
          ]
        }
      },
      "random_score": {}
    }
  },
  "highlight": {
    "fields": {
      "keyword": {}
    }
  }
}

Also, I need to convert the code to Java, which I am trying but stuck with following code.

Set<String> mustNot = new HashSet<String>();
mustNot.add("child");
mustNot.add("children");
mustNot.add("childrens health");

Set<String> must = new HashSet<String>();
must.add("fever");
must.add("cough");

FunctionScoreQueryBuilder fsqb = new FunctionScoreQueryBuilder(QueryBuilders.matchQuery("question", "Hello Dr. Iam suffering from fever with cough nd cold since 3 days"));
fsqb.add(ScoreFunctionBuilders.randomFunction((new Date()).getTime()));

BoolQueryBuilder bqb = boolQuery()
        .mustNot(termsQuery("topics", mustNot));

SearchResponse response1 = client.prepareSearch("stopdata")
        .setQuery(fsqb)
        .execute()
        .actionGet();

System.out.println(response1.getHits().getTotalHits());

The mapping of the 'stopdata' index is as follow

{
   "stopdata": {
      "mappings": {
         "questions": {
            "properties": {
               "answers": {
                  "type": "string"
               },
               "id": {
                  "type": "long"
               },
               "question": {
                  "type": "string",
                  "analyzer": "my_english"
               },
               "relevantQuestions": {
                  "type": "long"
               },
               "topics": {
                  "type": "string"
               }
            }
         }
      }
   }
}

Adding the sample data for the above index

"question": "My son of age 8 months is suffering from cough and cold and fever. What treatment I have to follow?"
"topics": [
  "Cough",
  "Fever",
  "Hydration",
  "Nutrition",
  "Tens",
  "Childrens health"
]

"question": "Hi.My daughter, 4 years old , has on and of fever  with severe coughing and colds for 3 days now.She vomited as well last night.Do you think it's viral?"
"topics": [
  "Vomiting",
  "Flu",
  "Cough",
  "Fever",
  "Pneumonia",
  "Meningitis",
  "Tamiflu",
  "Incision",
  "Childrens health",
  "Oseltamivir"
]

"question": "If you have a fever of 101 with chills and sweats for 2 day with a slight cough, should you go to the drs or let is wear off?"
"topics": [
  "Cough",
  "Fever"
]

Solution

  • The thing I see is that the whole filter part is misplaced, it should go inside the filtered query, because there is no filter element at the root of the function_score element (see official docs). So your query should look like this in the first place + you should use POST instead of GET since you're sending a payload:

    POST stopdata/_search
    {
      "query": {
        "function_score": {
          "query": {
            "filtered": {
              "query": {
                "match": {
                  "question": "Hello Dr. Iam suffering from fever with cough nd cold since 3 days"
                }
              },
              "filter": {
                "bool": {
                  "must": [
                    {
                      "terms": {
                        "topics": [
                          "fever",
                          "cough"
                        ]
                      }
                    }
                  ],
                  "must_not": [
                    {
                      "terms": {
                        "topics": [
                          "children",
                          "child",
                          "childrens health"
                        ]
                      }
                    }
                  ]
                }
              }
            }
          },
          "random_score": {}
        }
      },
      "highlight": {
        "fields": {
          "keyword": {}
        }
      }
    }
    

    Now to write all this in Java, it goes like this:

    Set<String> mustNot = new HashSet<String>();
    mustNot.add("child");
    mustNot.add("children");
    mustNot.add("childrens health");
    
    Set<String> must = new HashSet<String>();
    must.add("fever");
    must.add("cough");
    
    MatchQueryBuilder query = QueryBuilders.matchQuery("question", "Hello Dr. Iam suffering from fever with cough nd cold since 3 days");
    
    BoolFilterBuilder filter = FilterBuilders.boolFilter()
        .must(FilterBuilders.termsFilter("topics", must))
        .mustNot(FilterBuilders.termsFilter("topics", mustNot));
    
    FilteredQueryBuilder fqb = QueryBuilders.filteredQuery(query, filter);
    
    FunctionScoreQueryBuilder fsqb = QueryBuilders.functionScoreQuery(fqb);
    fsqb.add(ScoreFunctionBuilders.randomFunction((new Date()).getTime()));
    
    SearchResponse response1 = client.prepareSearch("stopdata")
            .setQuery(fsqb)
            .execute()
            .actionGet();
    
    System.out.println(response1.getHits().getTotalHits());
    

    UPDATE

    The reason why the must_not doesn't match childrens health is because the topics field is analyzed and thus the Childrens health gets tokenized and analyzed as two tokens childrens and health, thus trying a terms match on childrens health will not yield anything. Maybe, splitting into two terms would help:

                  "must_not": [
                    {
                      "terms": {
                        "topics": [
                          "children",
                          "child",
                          "childrens", 
                          "health"
                        ]
                      }
                    }
                  ]