Search code examples
splunksplunk-query

Splunk : Record deduplication using an unique field


We are considering moving out log analytics solution from ElasticSearch/Kibana to Splunk.

We currently use "document id" in ElasticSearch to deduplicate records when indexing :

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

We generate the id using hash of the content of the each log-record.

In Splunk, I found the internal field "_cd" which is unique to each record in Splunk index: https://docs.splunk.com/Documentation/Splunk/8.1.0/Knowledge/Usedefaultfields

However, using HTTP Event Collector to ingest records, I couldn't find any way to embed this "_cd" field in the request : https://docs.splunk.com/Documentation/Splunk/8.1.0/Data/HECExamples

Any tips on how to achieve this in Splunk ?


Solution

  • What are you trying to achieve?

    If you're sending "unique" events to the HEC, or you're running UFs on "unique" logs, you'll never get duplicate "records when indexing".

    It sounds like you (perhaps routinely?) resend the same data to your aggregation platform - which is not a problem with the aggregator, but with your sending process.

    Almost like you're doing a MySQL/PostgreSQL "insert if not exists" operation. If that is a correct understanding of your situation, based on your statement

    We currently use "document id" in ElasticSearch to deduplicate records when indexing:
    https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
    We generate the id using hash of the content of the each log-record.

    then you need to evaluate what is going "wrong" in your sending process that you feel you need to pre-clean the data before ingesting it.

    It is true that Splunk won't "deduplicate records when indexing" - because it presumes the data coming-in to be 'correct' from whatever is submitting it.

    How are you getting duplicate data in the first place?

    Fields in Splunk which begin with the underscore (eg _time, _cd, etc) are not editable/sendable - they're generated by Splunk when it receives data. IOW, they're all internal fields. Searchable. Usable. But not overrideable.

    If you really have a problem with [lots of/too much] duplicate data, and there is no way to fix your sending process[es], then you'll need to rely on deduplication operations in SPL when searching for/reporting on whatever you've ingested (primarily by using stats and, when absolutely necessary/unavoidable, dedup).