Search code examples
elasticsearchnulltri-state-logic

How can we deal with NULL values that have specific meanings?


Issue

I'm trying to save a boolean value to elasticsearch, but it is specifically valid for it to be NULL. A sort of don't-care in this case.

There seem to be several options, but it is not completely clear what the best would be.

We are using ElasticSearch version 5.0.2

Option 1

The trivial one would be to save it as a boolean with NULL values. Those would be seen as 'missing' by ES.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_boolean": { "type": "boolean"}
      }
    }
  }
}

PUT my_index/my_type/1
{"my_boolean": true}

PUT my_index/my_type/2
{"my_boolean": false}

PUT my_index/my_type/3
{"my_boolean": null}

This has several issues, one of them being aggregations. There doesn't seem to be an easy way to get the values true, false and NULL in an aggregation.

The missing feature is known to me, so I know I can do the following:

GET my_index/_search
{
  "size":0,
  "aggregations": {
    "my_boolean": {
      "terms": {
        "field": "my_boolean"
      }
    },
    "missing_fields": {
          "missing" : {
            "field": "my_boolean"
          }
    }
  }
}

But this will result in a bucket with 2 values (true/false) and a separate count for the missing documents. That looks like it will cause problems.

Option 2

Another option is to actually give the NULL a value, as described in the manual. The problem is that the value needs to be the correct type, and there is nothing but true and false as a boolean.

The null_value needs to be the same datatype as the field. For instance, a long field cannot have a string null_value.

This means we can use a different type that supports more then 2 values, e.g. integer, but that would be in my head the same as saying: lets map it as integer, and define 1 as true, 2 as false and 3 as null. This would work, but we'd have an implicit mapping that all should know about. (All producers/consumers/whatyamahaveits).

Option 3

A final version would be to try and script our way out of this problem.

GET my_index/_search
{
  "size":0,
  "aggregations": {

    "my_boolean": {
      "terms": {
       "script" : {
        "inline": "if(doc['my_boolean'].length === 1) { if(doc['my_boolean'].value === true){ return 1;} else {return 2;} } else { return 3;}"
        }
      }
    }
  }
}

Now we do get the right results in somewhat sane buckets.

"aggregations": {
"my_boolean": {
  "doc_count_error_upper_bound": 0,
  "sum_other_doc_count": 0,
  "buckets": [
    {
      "key": "1",
      "doc_count": 1
    },
    {
      "key": "2",
      "doc_count": 1
    },
    {
      "key": "3",
      "doc_count": 1
    }
  ]
}
}

Note that we still have an implicit mapping with the keys here, so this seems to have some of the same issues that mapping it as an integer has. But still, your datatype is what it should be, so that might be something. Note that we cannot have a bucket with 'null' as key. We can call them "true", "false" and "null" (strings) ofcourse, but this is the same situation, but hidden even more.

Question

what is the best way to deal with this null-problem? (Or maybe we should call it a 'tri-state-boolean-problem'?)

To clarify: we fear that later on a 'non-standard' value might cause problems. The first we saw was the bucketing which we might be able to fix with above script solution, but maybe we run into other issues later. So we are looking for the best-practice of saving this type of data, rather then a quick solution for a specific problem.


Solution

  • In the end we went for mapping the various states to a byte.

    A missing value only works if the type is capable of having that value, so we need the mapping anyway, so we add the extra number during insertion.

    So instead of a boolean with true, false and null values, or an integer with 1, 2 and null (with missing = -1) values, we use a byte with 1, 2 and 3, meaning (in random order) true, false and null.