We're running reports over our application(s) log files, which are collected in an elastic search storage. The logs have among others following fields
@timestamp:
Oct 24, 2023 @ 20:15:52.043
Action:
INSTALL
Components:
a; b
UserName:
U12345
Ultimately, I'd need to run a query which will group by components, and return count of installs and unique users who've made these.
I'm able to group by in a query by the raw Components field, getting as result something like
{
"key" : "a;b",
"doc_count" : 10,
"unique_user_ids" : {
"value" : 2
}
},
{
"key" : "a;c",
"doc_count" : 8,
"unique_user_ids" : {
"value" : 1
}
},
However I can't figure out how to count each record as many times as a component is listed, getting as result something like
{
"key" : "a",
"doc_count" : 18,
"unique_user_ids" : {
"value" : 3
}
},
{
"key" : "b",
"doc_count" : 10,
"unique_user_ids" : {
"value" : 2
}
},
{
"key" : "c",
"doc_count" : 8,
"unique_user_ids" : {
"value" : 1
}
},
If you have control over the data that you are getting in, the best way to do it would be to split the key prior to indexing it. In other words your records should become:
{
"key" : ["a", "b"],
"user_name" : "foo"
},
{
"key" : ["a", "c"],
"user_name" : "bar"
},
A much slower solution that doesn't require reindexing your data would be to use a runtime field to split components. The runtime field can be defined either in the query or in the index mapping. Here is an example of how to do it on the query:
DELETE test
PUT test
{
"mappings": {
"properties": {
"key": {
"type": "keyword"
},
"user_name": {
"type": "keyword"
}
}
}
}
PUT test/_doc/1
{
"key" : "a;b",
"user_name" : "foo"
}
PUT test/_doc/2?refresh
{
"key" : "a;c",
"user_name" : "bar"
}
GET test/_search
{
"size": 0,
"runtime_mappings": {
"keys": {
"type": "keyword",
"script": """
def combined_key = /;/.split(doc['key'].value);
for (key in combined_key) {
emit(key)
}
"""
}
},
"aggs": {
"genres": {
"terms": {
"field": "keys"
}
}
}
}