I want perform aggregation on a uri field, but return only the domain part of the url rather than full url. For example, with the field, https://stackoverflow.com/questions/ask?guided=true
I would get stackoverflow.com
Given an existing dataset as follows:
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "L9WewGoBZqCeOmbRIMlV",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.espnfc.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308177Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "MNWewGoBZqCeOmbRIsma",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.org/wiki/Category:Terrorism_in_Mexico",
"date" : "2019-05-16T12:29:08.1308803Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "2V-ewGoBiHg_1GebJKIr",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308811Z"
}
}
]
My bucket should be as follows:
"buckets" : [
{
"key" : "www.espnfc.com",
"doc_count" : 1
},
{
"key" : "www.wikipedia.com",
"doc_count" : 2
}
]
I have the following code snippet on how i do the aggregation, however this aggregates based on full url rather than domain name
var searchResponse = client.Search<Person>(s =>
s.Size(0)
.Query(q => q
.MatchAll()
)
.Aggregations(a => a
.Terms("visited_pages", ta => ta
.Field(f => f.PageUri.Suffix("keyword"))
)
)
);
var aggregations = searchResponse.Aggregations.Terms("visited_pages");
Any assistance will be gratefully appreciated :)
I've made use of the below Terms Aggregation using Script.
Note that, looking at your data, I've come up with the string logic. Do test it and modify the logic based on what you are looking for.
Best approach would be try to have a separate field called hostname
with the values of what you are looking for and apply aggregation on top of it.
However, if you are stuck, I suppose below aggregation can help!!
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_unique_urls": {
"terms": {
"script" : {
"inline": """
String st = doc['pageUri.keyword'].value;
if(st==null){
return "";
} else {
return st.substring(0, st.lastIndexOf(".")+4);
}
""",
"lang": "painless"
}
}
}
}
}
Below is how my response appears:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_unique_urls": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "http://www.espnfc.com",
"doc_count": 1
},
{
"key": "http://www.wikipedia.org",
"doc_count": 1
},
{
"key": "https://en.wikipedia.org",
"doc_count": 1
}
]
}
}
}
Hope this helps!