Consider comma separated values in a field as separate values while aggregating in Elasticsearch

i want to do aggregation on a field which has values like

doc1_field: "A" doc2_field: "A, B" doc3_field: "A, B, C"

What mappings / settings I can use so that when I aggregate on this field I should get results like:

key: A count: 3

key: "B" count: 2

key: "C" count: 3

I have tried adding a comma separated analyzer on the above field but it's not giving me the desired results, I still get results like

key: "A" count: 1

key: "A, B, C" count: 1

which is not what I want..

Could you please help in how can I aggregate on this field??

Solution

The best way is to split the field at ingest time using an ingest pipeline and a split processor:

Simulating the following ingest pipeline with the same documents you provided would yield this:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "field_name",
          "separator": "\\s*[,;]\\s*"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "field_name": "A"
      }
    },
    {
      "_source": {
        "field_name": "A, B"
      }
    },
    {
      "_source": {
        "field_name": "A; B; C"
      }
    }
  ]
}

Results:

{
  "docs": [
    {
      "doc": {
        "_source": {
          "field_name": [
            "A"
          ]
        }
      }
    },
    {
      "doc": {
        "_source": {
          "field_name": [
            "A",
            "B"
          ]
        }
      }
    },
    {
      "doc": {
        "_source": {
          "field_name": [
            "A",
            "B",
            "C"
          ]
        }
      }
    }
  ]
}

This means that instead of indexing "A, B, C" you'd index an array containing three separate values: ["A", "B", "C"] which you can then easily aggregate on and get the results you want.

If you want to go this way, just create the pipeline:

PUT _ingest/pipeline/splitter
{
    "processors": [
      {
        "split": {
          "field": "field_name",
          "separator": "\\s*,\\s*"
        }
      }
    ]
}

And then index your documents by referencing it:

PUT index/_doc/1?pipeline=splitter
{
   "field": "A, B, C"
}