Search code examples
elasticsearchaggregationelasticsearch-aggregationelasticsearch-py

Elasticsearch Common of two Aggregations


I want to find the common doc counts of aggregation on top authors and top co-authors which are fields inside biblio data field of source in an index.

What I am currently doing is:

1.Calculating Aggregation on top 10 authors.(A,B,C,D.....).

2.Calculating Aggregation on top 10 co-authors (X,Y,Z,....).

3.Calculating doc count of intersection like count of common docs between these pairs :

[(A,X), (B,Y)....]. <-----RESULT

I tried sub-bucket aggregation but it gave me : [A:(top 10 corresponding A), B:(top 10 corresponding B).....].


Solution

  • Ok, so from the comments above continue as an answer to make it easier to read and no character limit.

    Comment

    I don't think you can use pipeline aggregation to achieve it.

    It's not a lot to process on client side i guess. only 20 records (10 for authors and 10 for co-authors) and it would be simple aggregate query.

    Another option would be to just get top 10 across both fields and also simple agg query.

    But if you really need intersection of both top10s on ES side go with Scripted Metric Aggregation. you can lay your logic in the code

    First option is as simple as:

    GET index_name/_search
    {
      "size": 0, 
      "aggs": {
        "firstname_dupes": {
          "terms": {
            "field": "authorFullName.keyword",
            "size": 10
          }
        },
        "lastname_dupes": {
          "terms": {
            "field": "coauthorFullName.keyword",
            "size": 10
          }
        }
      }
    }
    

    and then you do intersection of the results on the client side.

    Second would look like:

    GET index_name/_search
    {
      "size": 0, 
      "aggs": {
        "name_dupes": {
          "terms": {
            "script": {
              "source": "return [doc['authorFullName.keyword'].value,doc['coauthorFullName.keyword'].value]"
            }
            , "size": 10
          }
        }
      }
    }
    

    but it's not really an intersection of top10 authors and top10 coauthors. it's an intersection of all and then getting top10.

    The third option is to write Scripted Metric Aggregation. Didn't have time to spend on algorithmic side of things (it should be optimized) but it might look as this one. For sure java skills will help you. Also make sure you understand all the stages of Scripted Metric Aggregation execution and performance issues you might have using it.

    GET index_name/_search
    {
      "size": 0, 
        "query" : {
            "match_all" : {}
        },
        "aggs": {
            "profit": {
                "scripted_metric": {
                    "init_script" : "state.fnames = [:];state.lnames = [:];", 
                    "map_script" :
                    """
                    def key = doc['authorFullName.keyword'];
                    def value = '';
                    if (key != null && key.value != null) {
                      value = state.fnames[key.value];
                      if(value==null) value = 0;
                      state.fnames[key.value] = value+1
                    }
                    key = doc['coauthorFullName.keyword'];
                    if (key != null && key.value != null) {
                      value = state.lnames[key.value];
                      if(value==null) value = 0;
                      state.lnames[key.value] = value+1
                    }
                    """,
                    "combine_script" : "return state",
                    "reduce_script" : 
                    """
                    def intersection = [];
                    def f10_global = new HashSet();
                    def l10_global = new HashSet();
                    for (state in states) {
                      def f10_local = state.fnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
                      def l10_local = state.lnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
                      for(name in f10_local){f10_global.add(name);}
                      for(name in l10_local){l10_global.add(name);}
                    }
    
                    for(name in f10_global){
                      if(l10_global.contains(name)) intersection.add(name);
                    }
                    return intersection;
                    """
                }
            }
        }
    }
    

    Just a note, the queries here assume you have keyword on those properties. If not just adjust them to your case.

    UPDATE

    PS, just noticed you mentioned you need common counts, not common names. not sure what is the case but instead of map(e->e.getKey()) use map(e->e.getValue().toString()). See the other answer on similar problem