I am observing some strange behavior coming out of Elasticsearch 5.2 and it's impossible to debug-- as there are no errors thrown nor am I able to find similar issues/documentation online.
I'm storing a JSON array as a "string" in elasticsearch (using python's json.dumps()) -- long story short, I have to do it this way. However, when I do a DSL query, only the JSON arrays (stored as a singular string) containing 1 object are shown. If more than 1, then it just returns an empty bucket 0 objects. I'm storing them in a field called "metadata".
I'm very confused why only a subset of the data is shown, and other data (with more than 1 object in json array) is ignored. The data is encoded as string. I know for a fact the data stored in index. I can see it in kibana "discovery" -- as I can see large JSON strings with multiple objects.
Example 1 (JSON String w/ 1 object):
[{"score": 0.8829717636108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box1"}]
Example 2:
[{"score": 0.8829717636108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box1"}, {"score": 0.6821991136108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box2"}]
Here is my query:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
{
"range": {
"created_at": {
"gte": 1508012482796,
"lte": 1508014282797,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
},
"size": 0,
"_source": {
"excludes": []
},
"aggs": {
"5": {
"terms": {
"field": "metadata.keyword",
"size": 31,
"order": {
"_count": "desc"
}
}
}
}
}
This query only returns strings with 1 object. See below:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4214,
"max_score": 0,
"hits": []
},
"aggregations": {
"5": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 35,
"buckets": [
{
"key": "[]",
"doc_count": 102
},
{
"key": "{}",
"doc_count": 8
},
{
"key": "[{\"score\": 0.9015679955482483, \"height\": 0.8632315695285797, \"width\": 0.343660831451416, \"y\": 0.08102986216545105, \"x\": 0.5559845566749573, \"note\": \"box11\"}]",
"doc_count": 6
},
{
"key": "[{\"score\": 0.6365205645561218, \"height\": 0.9410756528377533, \"width\": 0.97696852684021, \"y\": 0.04701271653175354, \"x\": 0.013666868209838867, \"note\": \"box17\"}]",
"doc_count": 4
},
...
}
As observed, only data with JSON strings with 1 objects (i.e. [{..}]) are returned/visible. It's completely ignoring the strings with multiple objects (i.e. [{...},{...}]).
More Clarifications:
If you're using the default mapping, this is most probably because your keyword mapping has an ignore_above: 256
settings and looks like this:
{
"mappings": {
"my_type": {
"properties": {
"metadata": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
You can increase that limit in order to index your JSON strings longer than 256 characters.