Search code examples
elasticsearchnest

How to perform term aggregation based on Url Domain name using Nest ElasticClient


I want perform aggregation on a uri field, but return only the domain part of the url rather than full url. For example, with the field, https://stackoverflow.com/questions/ask?guided=true I would get stackoverflow.com Given an existing dataset as follows:

"hits" : [
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "L9WewGoBZqCeOmbRIMlV",
        "_score" : 1.0,
        "_source" : {
          "firstName" : "George",
          "lastName" : "Ouma",
          "pageUri" : "http://www.espnfc.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
          "date" : "2019-05-16T12:29:08.1308177Z"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "MNWewGoBZqCeOmbRIsma",
        "_score" : 1.0,
        "_source" : {
          "firstName" : "George",
          "lastName" : "Ouma",
          "pageUri" : "http://www.wikipedia.org/wiki/Category:Terrorism_in_Mexico",
          "date" : "2019-05-16T12:29:08.1308803Z"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "2V-ewGoBiHg_1GebJKIr",
        "_score" : 1.0,
        "_source" : {
          "firstName" : "George",
          "lastName" : "Ouma",
          "pageUri" : "http://www.wikipedia.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
          "date" : "2019-05-16T12:29:08.1308811Z"
        }
      }
    ]

My bucket should be as follows:

"buckets" : [
        {
          "key" : "www.espnfc.com",
          "doc_count" : 1
        },
        {
          "key" : "www.wikipedia.com",
          "doc_count" : 2
        }
      ]

I have the following code snippet on how i do the aggregation, however this aggregates based on full url rather than domain name

var searchResponse = client.Search<Person>(s =>
    s.Size(0)

    .Query(q => q
        .MatchAll()
    )
    .Aggregations(a => a
        .Terms("visited_pages", ta => ta
            .Field(f => f.PageUri.Suffix("keyword"))
        )
    )
);

var aggregations = searchResponse.Aggregations.Terms("visited_pages");

Any assistance will be gratefully appreciated :)


Solution

  • I've made use of the below Terms Aggregation using Script.

    Note that, looking at your data, I've come up with the string logic. Do test it and modify the logic based on what you are looking for.

    Best approach would be try to have a separate field called hostname with the values of what you are looking for and apply aggregation on top of it.

    However, if you are stuck, I suppose below aggregation can help!!

    Aggregation Query:

    POST <your_index_name>/_search
    {
      "size": 0,
      "aggs": {
        "my_unique_urls": {
          "terms": {
            "script" : {
              "inline": """
                String st = doc['pageUri.keyword'].value;
                if(st==null){
                  return "";
                } else {
                  return st.substring(0, st.lastIndexOf(".")+4);
                }
              """,
              "lang": "painless"
            }
          }
        }
      }
    }
    

    Below is how my response appears:

    Query Response:

    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 4,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "my_unique_urls": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "http://www.espnfc.com",
              "doc_count": 1
            },
            {
              "key": "http://www.wikipedia.org",
              "doc_count": 1
            },
            {
              "key": "https://en.wikipedia.org",
              "doc_count": 1
            }
          ]
        }
      }
    }
    

    Hope this helps!