Search code examples
elasticsearchkibanavega-lite

In Kibana's Vega, how can I create layers from two different aggs in one request


In Elasticsearch's HTTP API, you can have a bucketing aggregation and a metric aggregation in a single request to the _search API. In Kibana's Vega environment, how can you create a Vega visualization which uses a single _search request with a buckets aggregation and a metric aggregation; and then makes a chart with one layer using data from the buckets and one layer using data from the metric?


To make this question more concrete, consider this example:

Imagine we are hat makers. Multiple stores carry our hats. We have an Elasticsearch index hat-sales which has one document for each time one of our hats is sold. Included in this document is the store at which the hat was sold.

Here are two examples of the documents in this index:

{
  "type": "top",
  "color": "black",
  "price": 19,
  "store": "Macy's"
}
{
  "type": "fez",
  "color": "red",
  "price": 94,
  "store": "Walmart"
}

I want to create a bar chart which shows the number of hats sold in the top 3 stores. I also want a horizontal rule on this chart which shows the average number of hats sold over all stores - not just the top 3. Here is a sketch of what I want the chart to look like:

enter image description here

If we did this, having Vega do the calculation of the average:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "title": "Hat Sales",
  "data": {
    "url": {
      "index": "hat-sales",
      "body": {
        "size": 0,
        "query": {"match_all": {}},
        "aggs": {"stores": {"terms": {"field": "store.keyword", "size": 3}}}
      }
    },
    "format": {"property": "aggregations.stores.buckets"}
  },
  "transform": [
    {"calculate": "datum.key", "as": "store"},
    {"calculate": "datum.doc_count", "as": "count"}
  ],
  "layer": [
    {
      "name": "Sales of top 3 stores",
      "mark": "bar",
      "encoding": {
        "x": {"type": "nominal", "field": "store", "sort": "-y"},
        "y": {"type": "quantitative", "field": "count"}
      }
    },
    {
      "name": "Average number of sales over all stores",
      "mark": {"type": "rule", "color": "red"},
      "encoding": {"y": {"aggregate": "mean", "field": "count"}}
    }
  ]
}

which would looks like this: enter image description here then the horizontal rule would be an average of the top 3 stores only. Instead we need to add another metric aggregation to the Elasticsearch request which calculates the global average of hats sold at stores (https://stackoverflow.com/a/69668089/5938725). We want to do something like this:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "title": "Hat Sales",
  "data": {
    "url": {
      "index": "hat-sales",
      "body": {
        "size": 0,
        "query": {"match_all": {}},
        "aggs": {
          "stores": {"terms": {"field": "store.keyword", "size": 3}},
          "global": {
            "filters": {
              "filters": {"all": {"exists": {"field": "store.keyword"}}}
            },
            "aggs": {
              "count": {"value_count": {"field": "store.keyword"}},
              "unique_count": {"cardinality": {"field": "store.keyword"}},
              "global_average": {
                "bucket_script": {
                  "buckets_path": {"total": "count", "unique": "unique_count"},
                  "script": "params.total / params.unique"
                }
              }
            }
          }
        }
      }
    },
    "format": {"property": "aggregations.stores.buckets"}
  },
  "transform": [
    {"calculate": "datum.key", "as": "store"},
    {"calculate": "datum.doc_count", "as": "count"}
  ],
  "layer": [
    {
      "name": "Sales of top 3 stores",
      "mark": "bar",
      "encoding": {
        "x": {"type": "nominal", "field": "store", "sort": "-y"},
        "y": {"type": "quantitative", "field": "count"}
      }
    },
    {
      "name": "Average number of sales over all stores",
      "mark": {"type": "rule", "color": "red"},
      ??????????????????
    }
  ]
}

But how can I have one layer use the data from "aggregations.stores.buckets" and another layer use data from "aggregations.global.buckets" in order to access that global_average?


Solution

  • I did get it to work using this:

    {
      "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
      "description": "A simple bar chart with embedded data.",
      "data": {
        "url": {
          "index": "hat-sales",
          "body": {
            "size": 0,
            "query": {"match_all": {}},
            "aggs": {
              "stores": {"terms": {"field": "store.keyword", "size": 3}},
              "global": {
                "filters": {
                  "filters": {"all": {"exists": {"field": "store.keyword"}}}
                },
                "aggs": {
                  "count": {"value_count": {"field": "store.keyword"}},
                  "unique_count": {"cardinality": {"field": "store.keyword"}},
                  "global_average": {
                    "bucket_script": {
                      "buckets_path": {"total": "count", "unique": "unique_count"},
                      "script": "params.total / params.unique"
                    }
                  }
                }
              }
            }
          }
        }
      },
      "transform": [
        {"flatten": ["aggregations.stores.buckets"]},
        {"calculate": "datum['aggregations.stores.buckets'].key", "as": "store"},
        {
          "calculate": "datum['aggregations.stores.buckets'].doc_count",
          "as": "count"
        },
        {
          "calculate": "datum.aggregations.global.buckets.all.global_average.value",
          "as": "global_average"
        }
      ],
      "layer": [
        {
          "name": "Sales of top 3 stores",
          "mark": "bar",
          "encoding": {
            "x": {"type": "nominal", "field": "store", "sort": "-y"},
            "y": {"type": "quantitative", "field": "count"}
          }
        },
        {
          "name": "Global Average",
          "mark": {"type": "rule", "color": "red"},
          "encoding": {"y": {"field": "global_average", "type": "quantitative"}}
        }
      ]
    }
    

    It is less than ideal because the flatten transforms makes it so that the individual datum objects are somewhat bigger. It is also confusing that once you flatten aggregations.stores.buckets, that becomes the literal name-- "aggregations.stores.buckets"-- of a field of datum, which must be accessed via square bracket notation because it contains periods. enter image description here