Search code examples
elasticsearchelasticsearch-aggregation

How to get duplicate field values and their count in Elasticsearch


I have a school project in which I use the ELK stack.

I have lots of data and I want to know which log lines are duplicate and how many duplicates there are for that particular log line based on their log level, server and time range.

I tried this query in which I succesfully extracted the duplicate numbers:

GET /_all/_search
{
  "query": {
"bool": {
  "must": [        
    {
      "match": {
        "beat.hostname": "server-x"
      }
    },
    {
      "match": {
        "log_level": "WARNING"
      }
    },{
      "range": {
      "@timestamp" : {
        "gte" : "now-48h",
        "lte" : "now"
      }
    }
    }
  ]
}
  },
  "aggs": {
"duplicateNames": {
  "terms": {
    "field": "message_description.keyword",
    "min_doc_count": 2,
    "size": 10000
  }
}
  }
}

It successfully gives me the output:

"aggregations" : {
"duplicateNames" : {
  "doc_count_error_upper_bound" : 0,
  "sum_other_doc_count" : 0,
  "buckets" : [
    {
      "key" : "AuthToken not found [ ]",
      "doc_count" : 657
    }
  ]
}

When I try the very same query and only change the log_level from WARNING to CRITICAL it gives me 0 buckets. This is strange cause I can see in Kibana that there are duplicate message_description field values. Does this has something to do with the .keyword or maybe the length of the message_description?

I hope someone can help me with this weird problem.

Edit: These are two documents that are having exactly the same message_description, why can't I get the results?

 {
        "_index" : "filebeat-2019.09.17",
        "_type" : "_doc",
        "_id" : "yYzDP20BiDGBoVteKHjZ",
        "_score" : 10.144365,
        "_source" : {
          "beat" : {
            "name" : "graylog",
            "hostname" : "server-x",
            "version" : "6.8.2"
          },
          "message" : """[2019-09-17 17:06:57] request.CRITICAL: Uncaught PHP Exception ErrorException: "Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory" at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php line 444 {"exception":"[object] (ErrorException(code: 0): Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php:444)"} []""",
          "@version" : "1",
          "source" : "/data/httpd/xxx/xxx/var/log/dev.log",
          "tags" : [
            "beats_input_codec_plain_applied",
            "_grokparsefailure",
            "_dateparsefailure"
          ],
          "timestamp" : "2019-09-17 17:06:57",
          "input" : {
            "type" : "log"
          },
          "offset" : 54819,
          "prospector" : {
            "type" : "log"
          },
          "application" : "request",
          "log_level" : "CRITICAL",
          "stack_trace" : """{"exception":"[object] (ErrorException(code: 0): Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php:444)"} []""",
          "message_description" : """Uncaught PHP Exception ErrorException: "Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory" at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php line 444""",
          "@timestamp" : "2019-09-17T15:06:57.436Z",
          "host" : {
            "name" : "graylog"
          },
          "log" : {
            "file" : {
              "path" : "/data/httpd/xxx/xxx/var/log/dev.log"
            }
          }
        }
      },
      {
        "_index" : "filebeat-2019.09.17",
        "_type" : "_doc",
        "_id" : "CYzDP20BiDGBoVteKHna",
        "_score" : 10.144365,
        "_source" : {
          "beat" : {
            "name" : "graylog",
            "hostname" : "server-x",
            "version" : "6.8.2"
          },
          "message" : """[2019-09-17 17:06:56] request.CRITICAL: Uncaught PHP Exception ErrorException: "Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory" at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php line 444 {"exception":"[object] (ErrorException(code: 0): Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php:444)"} []""",
          "@version" : "1",
          "source" : "/data/httpd/xxx/xxx/var/log/dev.log",
          "tags" : [
            "beats_input_codec_plain_applied",
            "_grokparsefailure",
            "_dateparsefailure"
          ],
          "timestamp" : "2019-09-17 17:06:56",
          "input" : {
            "type" : "log"
          },
          "offset" : 45716,
          "prospector" : {
            "type" : "log"
          },
          "application" : "request",
          "log_level" : "CRITICAL",
          "stack_trace" : """{"exception":"[object] (ErrorException(code: 0): Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php:444)"} []""",
          "message_description" : """Uncaught PHP Exception ErrorException: "Warning: include(/data/httpd/xxx/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory" at /data/httpd/xxx/xxx/vendor/composer/ClassLoader.php line 444""",
          "@timestamp" : "2019-09-17T15:06:57.426Z",
          "host" : {
            "name" : "graylog"
          },
          "log" : {
            "file" : {
              "path" : "/data/httpd/xxx/xxx/var/log/dev.log"
            }
          }
        }
      }

Solution

  • What happens is that the message_description field is longer than 256 characters and thus gets ignored. Run GET filebeat-2019.09.17 to confirm this.

    What you can do is augment that limit by modifying the mapping of the field like this:

    PUT filebeat-*/_doc/_mapping
    {
      "properties": {
        "message_description": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 500
            }
          }
        }
      }
    }
    

    And then update all the data present in those indexes:

    POST filebeat-*/_update_by_query
    

    Once that's done, your query will magically work again ;-)