elasticsearch filter aggregation-framework bucket

Elasticsearch filter/aggregation by next/previous array item

Let say three are these three documents and need to write an elasticsearch query which gets an item name parameter and returns next items(calculate by using order) of it with the occurrence.

itemArray is defined as nestedObject, but not necessary to be nested. I'm lost in the documentation a bit. Any help will be appreciated.

Data Example:

doc-1

{
  "id" : 0
  "itemArray": [
     {
        "name":"X",
        "order" : 0
     },
     {
        "name":"Y",
        "order" : 1
     },
     {
        "name":"Z",
        "order" : 2
     }
  ]
}

doc-2

{
  "id" : 1
  "itemArray": [
     {
        "name":"X",
        "order" : 0
     },
     {
        "name":"Y",
        "order" : 1
     },
     {
        "name":"T",
        "order" : 2
     }
  ]
}

doc-3

{
  "id" : 2
  "itemArray": [
     {
        "name":"X",
        "order" : 0
     },
     {
        "name":"Y",
        "order" : 1
     },
     {
        "name":"Z",
        "order" : 2
     }
  ]
}

Response Example for the input "X", There are three document contain Y; after X in its array according to order:

{
    "Y": 3
}

Response Example for the input "Y" There are two document contain Z and one document contain T; after Y in its array according to order:

{
    "Z": 2,
    "T": 1
}

ElasticSearch version: 6.2

Solution

It is pretty feasible if you consider denormalizing your data a little bit.

How can "next element in array" aggregation be implemented?

Consider that your mapping would look like this:

PUT nextval
{
  "mappings": {
    "item": {
      "properties": {
        "id": {
          "type": "long"
        },
        "itemArray": {
          "type": "nested",
          "properties": {
            "name": {
              "type": "keyword"
            },
            "nextName": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

Here we store in a nested next value of the array explicitly. Now let's insert the data:

POST nextval/item/0
{
  "id" : 0,
  "itemArray": [
     {
        "name":"X",
        "nextName":"Y"
     },
     {
        "name":"Y",
        "nextName":"Z"
     },
     {
        "name":"Z"
     }
  ]
}

POST nextval/item/1
{
  "id" : 1,
  "itemArray": [
     {
        "name":"X",
        "nextName":"Y"
     },
     {
        "name":"Y",
        "nextName":"T"
     },
     {
        "name":"T"
     }
  ]
}

POST nextval/item/2
{
  "id" : 2,
  "itemArray": [
     {
        "name":"X",
        "nextName":"Y"
     },
     {
        "name":"Y",
        "nextName":"Z"
     },
     {
        "name":"Z"
     }
  ]
}

And use a query like this to obtain the result for the input X:

POST nextval/item/_search
{
  "query": {
    "nested": {
      "path": "itemArray",
      "query": {
        "term": {
          "itemArray.name": "X"
        }
      }
    }
  },
  "aggs": {
    "1. setup nested": {
      "nested": {
        "path": "itemArray"
      },
      "aggs": {
        "2. filter agg results": {
          "filter": {
            "term": {
              "itemArray.name": "X"
            }
          },
          "aggs": {
            "3. aggregate by nextName": {
              "terms": {
                "field": "itemArray.nextName"
              }
            }
          }
        }
      }
    }
  }
}

The output will look like this:

{
  ...,
  "aggregations": {
    "1. setup nested": {
      "doc_count": 9,
      "2. filter agg results": {
        "doc_count": 3,
        "3. aggregate by nextName": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "Y",
              "doc_count": 3
            }
          ]
        }
      }
    }
  }
}

If we do the query for the input Y the output will be:

{
  ...,
  "aggregations": {
    "1. setup nested": {
      "doc_count": 9,
      "2. filter agg results": {
        "doc_count": 3,
        "3. aggregate by nextName": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "Z",
              "doc_count": 2
            },
            {
              "key": "T",
              "doc_count": 1
            }
          ]
        }
      }
    }
  }
}

How does it work?

One important thing to know about nested objects is:

each nested object is indexed as a hidden separate document

I recommend reading this page of the Guide, they provide great explanation and examples.

Since these objects are separate, we lose the information about their position in the array. This is the reason you put order there in the first place.

That's why we put the nextName field in the nested object: so the object itself knows which is its neighbor.

Ok, but why the aggregation is so complex?

Let's recap. In our query there are basically 4 essential points:

query by the itemArray.name==X
1-level aggregation, nested
2-level aggregation, filter
3-level aggregation, terms

The 1) is pretty obvious: we only want documents that correspond our request. The 2) is also straightforward: since itemArray is a nested, we can only do aggregations within nested context.

The 3) one is tricky. Let's return to the output of the query:

{
  ...,
  "aggregations": {
    "1. setup nested": {
      "doc_count": 9,
      "2. filter agg results": {
        "doc_count": 3,
        "3. aggregate by nextName": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "Z",
              "doc_count": 2
            },
            {
              "key": "T",
              "doc_count": 1
            }
          ]
        }
      }
    }
  }
}

The doc_count of the first aggregation is 9. Why 9? Because this is the amount of nested objects we have in the documents matched our search query.

This is why we need the 3) aggregation: from all items select only those that have itemArray.name==X.

And the 4) one is again simple: just count how many times each term of field itemArray.nextName is met.

Are there better ways?

Probably, yes. It depends on your data and on your needs and how free are you to change the mapping. For instance, if you are just exploring your data, the potential of scripted aggregations is huge.

Hope that helps!