Search code examples
elasticsearchfilteraggregatefull-text-indexing

How to limit ElasticSearch results by a field value?


We've got a system that indexes resume documents in ElasticSearch using the mapper attachment plugin. Alongside the indexed document, I store some basic info, like if it's tied to an applicant or employee, their name, and the ID they're assigned in the system. A query that runs might look something like this when it hits ES:

{
  "size" : 100,
  "query" : {
    "query_string" : {
      "query" : "software AND (developer OR engineer)",
       "default_field" : "fileData"
    }
  },
  "_source" : {
    "includes" : [ "applicant.*", "employee.*" ]
  }
}

And gets me results like:

"hits": [100]
    0:  {
      "_index": "careers"
      "_type": "resume"
      "_id": "AVEW8FJcqKzY6y-HB4tr"
      "_score": 0.4530588
      "_source": {
      "applicant": {
        "name": "John Doe"
        "id": 338338
        }
      }
    }...

What I'm trying to do is limit the results, so that if John Doe with id 338338 has three different resumes in the system that all match the query, I only get back one match, preferably the highest scoring one (though that's not as important, as long as I can find the person). I've been trying different options with filters and aggregates, but I haven't stumbled across a way to do this.

There are various approaches I can take in the app that calls ES to tackle this after I get results back, but if I can do it on the ES side, that would be preferable. Since I'm limiting the query to say, 100 results, I'd like to get back 100 individual people, rather than getting back 100 results and then finding out that 25% of them are docs tied to the same person.


Solution

  • What you want to do is an aggregation to get the top 100 unique records, and then a sub aggregation asking for the "top_hits". Here is an example from my system. In my example I'm:

    1. setting the result size to 0 because I only care about the aggregations
    2. setting the size of the aggregation to 100
    3. for each aggregation, get the top 1 result

    GET index1/type1/_search { "size": 0, "aggs": { "a1": { "terms": { "field": "input.user.name", "size": 100 }, "aggs": { "topHits": { "top_hits": { "size": 1 } } } } } }