Search code examples
elasticsearchaggregate-functions

Is it possible to compute "distinct sum" and "distinct average" in elasticsearch?


How can I calculate a "distinct average" in elasticsearch? I have some denormalized data like this:

{ "record_id" : "100", "cost" : 42 }
{ "record_id" : "200", "cost" : 67 }
{ "record_id" : "200", "cost" : 67 }
{ "record_id" : "200", "cost" : 67 }
{ "record_id" : "400", "cost" : 11 }
{ "record_id" : "400", "cost" : 11 }
{ "record_id" : "500", "cost" : 10 }
{ "record_id" : "600", "cost" : 99 }

Notice how the "cost" is always the same for a given "record_id".

So with the above data:

  1. How can I get the AVERAGE values for the "cost" field but DISTINCT by "record_id"? Result would be (42+67+11+10+99)/5=45.8

  2. How can I get the SUM values for the "cost" field but DISTINCT by "record_id"? Result would be 42+67+11+10+99=229

Could I use a combination of a "terms" aggregation and then "first" and "average" sub-aggregations? I'm thinking something like this: elasticsearch calculate average of unique values


Solution

  • It's not going to work with terms aggs. Here's what's possible using painless scripts:

    Indexing -- your actual mapping may differ from the generated default (esp the .keyword part on the rec_id):

    POST _bulk
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"100","cost":42}
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"200","cost":67}
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"200","cost":67}
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"200","cost":67}
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"400","cost":11}
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"400","cost":11}
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"500","cost":10}
    {"index":{"_index":"uniques","_type":"_doc"}}
    {"record_id":"600","cost":99}
    

    Then aggregating

    GET uniques/_search
    {
      "size": 0,
      "aggs": {
        "terms": {
          "scripted_metric": {
            "init_script": "state.id_map = [:]; state.sum = 0.0; state.elem_count = 0.0;",
            "map_script": """
              def id = doc['record_id.keyword'].value;
              if (!state.id_map.containsKey(id)) {
                state.id_map[id] = true;
                state.elem_count++;
                state.sum += doc['cost'].value;
              }
            """,
            "combine_script": """
                def sum = state.sum;
                def avg = sum / state.elem_count;
                
                def stats = [:];
                stats.sum = sum;
                stats.avg = avg;
                
                return stats
            """,
            "reduce_script": "return states"
          }
        }
      }
    }
    

    And yielding

    ...
    "aggregations" : {
        "terms" : {
          "value" : [
            {
              "avg" : 45.8,
              "sum" : 229.0
            }
          ]
        }
      }