Search code examples
elasticsearchelastic-stackelasticsearch-6elastica

ElasticSearch group by documents field and count occurences


My ElasticSearch 6.5.2 index look likes:

      {
    "_index" : "searches",
    "_type" : "searches",
    "_id" : "cCYuHW4BvwH6Y3jL87ul",
    "_score" : 1.0,
    "_source" : {
      "querySearched" : "telecom",
    }
  },
  {
    "_index" : "searches",
    "_type" : "searches",
    "_id" : "cSYuHW4BvwH6Y3jL_Lvt",
    "_score" : 1.0,
    "_source" : {
      "querySearched" : "telecom",
    }
  },
  {
    "_index" : "searches",
    "_type" : "searches",
    "_id" : "eCb6O24BvwH6Y3jLP7tM",
    "_score" : 1.0,
    "_source" : {
      "querySearched" : "industry",
    }

And I would like a query that return this result:

"result": 
{
"querySearched" : "telecom",
"number" : 2
},
{
"querySearched" : "industry",
"number" : 1
}

I just want to group by occurence and get number of each, limit to ten biggest numbers. I tried with aggregations but bucket is empty. Thanks!


Solution

  • Case your mapping

    PUT /index
    {
      "mappings": {
        "doc": {
          "properties": {
            "querySearched": {
              "type": "text",
              "fielddata": true
            }
          }
        }
      }
    }
    

    Your query should looks like

    GET index/_search
    {
      "size": 0,
      "aggs": {
        "result": {
          "terms": {
            "field": "querySearched",
            "size": 10
          }
        }
      }
    }
    

    You should add fielddata:true in order to enable aggregation for text type field more of that

        "size": 10, => limit to 10
        
    

    After a short discussion with @Kamal i feel obligated to let you know that if you choose to enable fielddata:true you must know that it can consume a lot of heap space.

    From the link I've shared:

    Fielddata can consume a lot of heap space, especially when loading high cardinality text fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process which can cause users to experience latency hits. This is why fielddata is disabled by default.

    Another alternative (a more efficient one):

    PUT /index
    {
      "mappings": {
        "doc": {
          "properties": {
            "querySearched": {
              "type": "text",
              "fields": {
               "keyword": {
                 "type": "keyword",
                 "ignore_above": 256
               }
             }
            }
          }
        }
      }
    }
    

    Then your aggregation query

    GET index/_search
    {
      "size": 0,
      "aggs": {
        "result": {
          "terms": {
            "field": "querySearched.keyword",
            "size": 10
          }
        }
      }
    }
    

    Both solutions works but you should take this under consideration.

    Hope it helps