Storing JSON array string elasticsearch Bug

I am observing some strange behavior coming out of Elasticsearch 5.2 and it's impossible to debug-- as there are no errors thrown nor am I able to find similar issues/documentation online.

I'm storing a JSON array as a "string" in elasticsearch (using python's json.dumps()) -- long story short, I have to do it this way. However, when I do a DSL query, only the JSON arrays (stored as a singular string) containing 1 object are shown. If more than 1, then it just returns an empty bucket 0 objects. I'm storing them in a field called "metadata".

I'm very confused why only a subset of the data is shown, and other data (with more than 1 object in json array) is ignored. The data is encoded as string. I know for a fact the data stored in index. I can see it in kibana "discovery" -- as I can see large JSON strings with multiple objects.

Example 1 (JSON String w/ 1 object):

[{"score": 0.8829717636108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box1"}]

Example 2:

[{"score": 0.8829717636108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box1"}, {"score": 0.6821991136108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box2"}]

Here is my query:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "analyze_wildcard": true,
            "query": "*"
          }
        },
        {
          "range": {
            "created_at": {
              "gte": 1508012482796,
              "lte": 1508014282797,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "aggs": {
    "5": {
      "terms": {
        "field": "metadata.keyword",
        "size": 31,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

This query only returns strings with 1 object. See below:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4214,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "5": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 35,
      "buckets": [
        {
          "key": "[]",
          "doc_count": 102
        },
        {
          "key": "{}",
          "doc_count": 8
        },
        {
          "key": "[{\"score\": 0.9015679955482483, \"height\": 0.8632315695285797, \"width\": 0.343660831451416, \"y\": 0.08102986216545105, \"x\": 0.5559845566749573, \"note\": \"box11\"}]",
          "doc_count": 6
        },
        {
          "key": "[{\"score\": 0.6365205645561218, \"height\": 0.9410756528377533, \"width\": 0.97696852684021, \"y\": 0.04701271653175354, \"x\": 0.013666868209838867, \"note\": \"box17\"}]",
          "doc_count": 4
        },
...
}

As observed, only data with JSON strings with 1 objects (i.e. [{..}]) are returned/visible. It's completely ignoring the strings with multiple objects (i.e. [{...},{...}]).

More Clarifications:

It's using the default mappings
I am able to get the JSON string(regardless of the number of objects) when queried by document id, or using "match" by exact field values)

Solution

If you're using the default mapping, this is most probably because your keyword mapping has an ignore_above: 256 settings and looks like this:

{
  "mappings": {
    "my_type": {
      "properties": {
        "metadata": {
          "type":  "keyword",
          "ignore_above": 256
        }
      }
    }
  }
}

You can increase that limit in order to index your JSON strings longer than 256 characters.