Currently working with Kibana with 6 Billion + documents and trying to get a sampling based on the 'index' which is the particular day the sample was collected.
from elasticsearch import Elasticsearch
es = Elasticsearch(['https://user:secret@localhost:xxx'])
Using the code below to query:
res = es.search(body=body1)
print(f"Got {res['hits']['total']} Hits:")
When I use the body below, I get all 6 billion documents:
body1 = {
"query": {"match_all": {}}
}
However, when I set up an aggregation pipeline, I get the error RequestError(400, 'parsing exception', 'Unknown key for a START_OBJECT in [my_agg].')
body0 = {
"query": {"match_all": {}},
"size": 0,
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"max_docs_per_value" : 3,
"field" : "_index"
}
}
}, "my_agg": {
"terms": {
"field": "_index"
}
}
}
I believe that my problem lies with my second aggregator and not my first diversified sampler. I just want the output from the diversified sampler, but I am being forced to have a second aggregator.
You were almost there -- just gotta fix the nested-ness:
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"max_docs_per_value": 3,
"field": "_index"
},
"aggs": {
"my_agg": {
"terms": {
"field": "_index"
}
}
}
}
}
}