Search code examples
elasticsearchkibana

Consider comma separated values in a field as separate values while aggregating in Elasticsearch


i want to do aggregation on a field which has values like

doc1_field: "A" doc2_field: "A, B" doc3_field: "A, B, C"

What mappings / settings I can use so that when I aggregate on this field I should get results like:

key: A count: 3

key: "B" count: 2

key: "C" count: 3

I have tried adding a comma separated analyzer on the above field but it's not giving me the desired results, I still get results like

key: "A" count: 1

key: "A, B, C" count: 1

which is not what I want..

Could you please help in how can I aggregate on this field??


Solution

  • The best way is to split the field at ingest time using an ingest pipeline and a split processor:

    Simulating the following ingest pipeline with the same documents you provided would yield this:

    POST _ingest/pipeline/_simulate
    {
      "pipeline": {
        "processors": [
          {
            "split": {
              "field": "field_name",
              "separator": "\\s*[,;]\\s*"
            }
          }
        ]
      },
      "docs": [
        {
          "_source": {
            "field_name": "A"
          }
        },
        {
          "_source": {
            "field_name": "A, B"
          }
        },
        {
          "_source": {
            "field_name": "A; B; C"
          }
        }
      ]
    }
    

    Results:

    {
      "docs": [
        {
          "doc": {
            "_source": {
              "field_name": [
                "A"
              ]
            }
          }
        },
        {
          "doc": {
            "_source": {
              "field_name": [
                "A",
                "B"
              ]
            }
          }
        },
        {
          "doc": {
            "_source": {
              "field_name": [
                "A",
                "B",
                "C"
              ]
            }
          }
        }
      ]
    }
    

    This means that instead of indexing "A, B, C" you'd index an array containing three separate values: ["A", "B", "C"] which you can then easily aggregate on and get the results you want.

    If you want to go this way, just create the pipeline:

    PUT _ingest/pipeline/splitter
    {
        "processors": [
          {
            "split": {
              "field": "field_name",
              "separator": "\\s*,\\s*"
            }
          }
        ]
    }
    

    And then index your documents by referencing it:

    PUT index/_doc/1?pipeline=splitter
    {
       "field": "A, B, C"
    }