Search code examples
elasticsearchluceneelasticsearch-2.0

Elastic search More like this query with filter is adding results


I have the following definition of type "taggeable":

{
"mappings": {
    "taggeable" : {
        "_all" : {"enabled" : false},
        "properties" : {
            "category" : {
                "type" : "string"
            },
            "tags" : {
                "type" : "string",
                "term_vector" : "yes"
            }
        }
    }
}

}

Also I have this 5 documents:

Document1 (tags: "t1 t2", category: "cat1")
Document2 (tags: "t1"   , category: "cat1")
Document3 (tags: "t1 t3", category: "cat1")
Document4 (tags: "t4"   , category: "cat1")
Document5 (tags: "t4"   , category: "cat2")

The following query:

{
"query": {
    "more_like_this" : {
        "fields" : ["tags"],
        "like" : ["t1", "t2"],
        "min_term_freq" : 1,
        "min_doc_freq": 1
        }
    }
}

is returning:

Document1 (tags: "t1 t2", category: "cat1")
Document2 ("t1", category: "cat1")
Document3 ("t1 t3", category: "cat1")

Which is right, but this query:

{
"query": {
     "filtered": {
     "query": {
         "more_like_this" : {
         "fields" : ["tags"],
         "like" : ["t1", "t2"],
         "min_term_freq" : 1,
         "min_doc_freq": 1
     },
    "filter": {
         "bool": {
                "must": [                            
                    {"match": { "category": "cat1"}}
                ]
         }
    }
 }

} }

is returning:

Document1 (tags: "t1 t2", category: "cat1")
Document4 (tags: "t4"   , category: "cat1")
Document2 (tags: "t1"   , category: "cat1")
Document3 (tags: "t1 t3", category: "cat1")

This is, Document4 now is also retrieved and its score is similar than Documen1, that is a perfect match, even when Document4 has not any word included in "t1 t2".

Anyone knows what is happening? I'm using Elastic Search 2.4.6

Thanks in advance


Solution

  • This is a great example of why consistent indentation is important. Here, I've modified what you've posted with consistent indentation, and the problem is much more apparent (JSONLint is a handy tool, if you aren't using an editor that helps with this):

    {
      "query": {
        "filtered": {
          "query": {
            "more_like_this": {
              "fields": ["tags"],
              "like": ["t1", "t2"],
              "min_term_freq": 1,
              "min_doc_freq": 1
            },
            "filter": {
              "bool": {
                "must": [{
                  "match": {
                    "category": "cat1"
                  }
                }]
              }
            }
          }
        }
      }
    

    Your filter is a child of "query", instead of a child of "filtered".

    Really though, you shouldn't use filtered, it is deprecated, see here. You should change that to a bool, as specified there.