Search code examples
elasticsearchelasticsearch-painless

Is is possible to transform JSON data in an Elasticseach Painless script, and perform further operations on it?


We have a large corpus of JSON-formatted documents to search through to find patterns and historical trends. Elasticsearch seems like the perfect fit for this problem. The first trick is that the documents are collections of tens of thousands of "nested" documents (with a header). The second trick is that these nested documents represent data with varying types.

In order to accommodate this, all the value fields have been "encoded" as an array of strings, so a single integer value has been stored in the JSON as "[\"1\"]", and a table of floats is flattened to "[\"123.45\",\"678.9\",...]" and so on. (We also have arrays of strings, which don't need converting.) While this is awkward, I would have thought this would be a good compromise, given the way everything else involved in Elasticsearch seems to work.

The particular problem here is that these stored data values might represent a bitfield, from which we may need to inspect the state of one bit. Since this field will have been stored as a single-element string array, like "[\"14657\"], we need to convert that to a single integer, and then bit-shift it multiple times to the desired bit (or apply a mask, if such a function is available).

With Elasticsearch, I see that I can embed "Painless" scripts, but examples vary, and I haven't been able to find one that shows how I can covert the arbitrary-length string-array data field to appropriate types, for further comparison. Here's my query script as it stands.

{
  "_source" : false,
  "from" : 0, "size" : 10,
  "query": {
    "nested": {
      "path": "Variables",
      "query": {
        "bool": {
          "must": {
            "match": {"Variables.Designation": "Big_Long_Variable_Name"}
          },
          "must_not": {
            "match": {"Variables.Data": "[0]"}
          },
          "filter": {
            "script": {
              "script": {
                "source":
                "
                  def vals = doc['Variables.Data'];
                  return vals[0] != params.setting;
                ",
                "params": {
                  "setting": 3
                }
              }
            }
          }
        }
      },
      "inner_hits": {
        "_source": "Variables.Data"
      }
    }
  }
}

I need to somehow transform the vals variable to an array of ints, pick off the first value, do some bit operations, and make a comparison to return true or false. In this example, I'm hoping to be able to set "setting" equal to the bit position I want to check for on/off.

I've already been through the exercise with Elasticsearch in finding out that I needed to make my Variables.Data field a keyword so I could search on specific values in it. I realize that this is getting away from the intent of Elasticsearch, but I still think this might be the best solution, for other reasons. I created a new index, and reimported my test documents, and the index size went up about 30%. That's a compromise I'm willing to make, if I can get this to work.

What tools do I have in Painless to make this work? (Or, am I crazy to try to do this with this tool?)


Solution

  • I would suggest that you encode your data in elasticsearch provided types wherever possible (and even when not) to make the most out of painless. For instance, for the bit strings, you can encode them as an array of 1 and 0's for easier operations with Painless.

    Painless, in my opinion, is still primitive. It's hard to debug. It's hard to read. It's hard to maintain. And, it's a horrible idea to have large functions in Painless.

    To answer your question, you'd basically need to parse the array string with painless and have it in one of the available datatypes in order to do the comparison that you desire. For example, for the list, you'd use something like the split function, and then manually case each item in the results as int, float, string, etc...

    Use the execute API to test small bits before adding this to your scripted field:

    POST /_scripts/painless/_execute
    {
      "script": {
        "source": """
        ArrayList arr = []; //to start with
        // use arr.add(INDEX, VALUE) to add after parsing
        """,
        "params": {
          "foo": 100.0,
          "bar": 1000.0
        }
      }
    }
    

    On the other hand, if you save your data in ElasticSearch provided datatypes (note that ElasticSearch supports saving lists inside documents), then this task would be far easier to do in Painless.

    For example, instead of having my_doc.foo = "[\"123.45\",\"678.9\",...]" as a string to be parsed later, why not saving it as a native list of floats instead like my_doc.foo = [123.45, 678.9, ...]?

    This way, you avoid the unnecessary Painless code required to parse the text document.