Search code examples
elasticsearchkibanaelasticsearch-dsl

Storing JSON array string elasticsearch Bug


I am observing some strange behavior coming out of Elasticsearch 5.2 and it's impossible to debug-- as there are no errors thrown nor am I able to find similar issues/documentation online.

I'm storing a JSON array as a "string" in elasticsearch (using python's json.dumps()) -- long story short, I have to do it this way. However, when I do a DSL query, only the JSON arrays (stored as a singular string) containing 1 object are shown. If more than 1, then it just returns an empty bucket 0 objects. I'm storing them in a field called "metadata".

I'm very confused why only a subset of the data is shown, and other data (with more than 1 object in json array) is ignored. The data is encoded as string. I know for a fact the data stored in index. I can see it in kibana "discovery" -- as I can see large JSON strings with multiple objects.

Example 1 (JSON String w/ 1 object):

[{"score": 0.8829717636108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box1"}]

Example 2:

[{"score": 0.8829717636108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box1"}, {"score": 0.6821991136108398, "height": 0.875460147857666, "width": 0.3455989360809326, "y": 0.08105117082595825, "x": 0.5616265535354614, "note": "box2"}]

Here is my query:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "analyze_wildcard": true,
            "query": "*"
          }
        },
        {
          "range": {
            "created_at": {
              "gte": 1508012482796,
              "lte": 1508014282797,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "aggs": {
    "5": {
      "terms": {
        "field": "metadata.keyword",
        "size": 31,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

This query only returns strings with 1 object. See below:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4214,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "5": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 35,
      "buckets": [
        {
          "key": "[]",
          "doc_count": 102
        },
        {
          "key": "{}",
          "doc_count": 8
        },
        {
          "key": "[{\"score\": 0.9015679955482483, \"height\": 0.8632315695285797, \"width\": 0.343660831451416, \"y\": 0.08102986216545105, \"x\": 0.5559845566749573, \"note\": \"box11\"}]",
          "doc_count": 6
        },
        {
          "key": "[{\"score\": 0.6365205645561218, \"height\": 0.9410756528377533, \"width\": 0.97696852684021, \"y\": 0.04701271653175354, \"x\": 0.013666868209838867, \"note\": \"box17\"}]",
          "doc_count": 4
        },
...
}

As observed, only data with JSON strings with 1 objects (i.e. [{..}]) are returned/visible. It's completely ignoring the strings with multiple objects (i.e. [{...},{...}]).

More Clarifications:

  • It's using the default mappings
  • I am able to get the JSON string(regardless of the number of objects) when queried by document id, or using "match" by exact field values)

Solution

  • If you're using the default mapping, this is most probably because your keyword mapping has an ignore_above: 256 settings and looks like this:

    {
      "mappings": {
        "my_type": {
          "properties": {
            "metadata": {
              "type":  "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
    

    You can increase that limit in order to index your JSON strings longer than 256 characters.