Phrase & wildcard queries on Elasticsearch

I am facing some difficulties while trying to create a query that can match only whole phrases, but allows wildcards as well.

Basically I have a filed that contains a string (it is actually a list of strings, but for simplicity I am skipping that), which can contain white spaces or be null, lets call it "color".

For example:

{
  ...
  "color": "Dull carmine pink"
  ...
}

My queries need to be able to do the following:

search for null values (inclusive and exclusive)
search for non null values (inclusive and exclusive)
search for and match only a whole phrase (inclusive and exclusive). For example:
- dull carmine pink --> match
- carmine pink --> not a match
same as the last, but with wildcards (inclusive and exclusive). For example:
- ?ull carmine p* --> match to "Dull carmine pink"
- dull carmine* -> match to "Dull carmine pink"
- etc.

I have been bumping my head against the wall for a few days with this and I have tried almost every type of query I could think of.

I have only managed to make it work partially with a span_near query with the help of this topic.

So basically I can now:

search for a whole phrase with/without wildcards like this:

{
    "span_near": {
        "clauses": [
            {
                "span_term": {"color": "dull"}
            },
            {
                "span_term": {"color": "carmine"}
            },
            {
                "span_multi": {"match": {"wildcard": {"color": "p*"}}}
            }
        ],
        "slop": 0,
        "in_order": true
    }
}

search for null values (inclusive and exclusive) by simple must/must_not queries like this:
```
{
   "must" / "must_not": {'exist': {'field': 'color'}}
}
```

The problem: I cannot find a way to make an exclusive span query. The only way I can find is this. But it requires both include & exclude fields, and I am only trying to exclude some fields, all others must be returned. Is there some analog of the "match_all":{} query that can work inside of an span_not's include field? Or perhaps an entire new, more elegant solution?

Solution

I found the solution a month ago, but I forgot to post it here. I do not have an example at hand, but I will try to explain it.

The problem was that the fields I was trying to query were analyzed by elasticsearch before querying. The analyzer in question was dividing them by spaces etc. The solution to this problem is one of the two:

1. If you do not use a custom mapping for the index.

(Meaning if you let elasticsearch to dynamically create the appropriate mapping for your field when you were adding it).

In this case elastic search automatically creates a subfield of the text field called "keyword". This subfield uses the "keyword" analyzer which does not process the data in any way prior to querying.

Which means that queries like:

{
"query": {
    "bool": {
        "must": [ // must_not
            {
                "match": {
                    "user.keyword": "Kim Chy"
                }
            }
        ]
    }
}

} and

{
"query": {
    "bool": {
        "must": [ // must_not
            {
                "wildcard": {
                    "user.keyword": "Kim*y"
                }
            }
        ]
    }
}

}

should work as expected.

However with the default mapping, the keyword field will most likely be case-sensitive. In order for it to be case-insensitive as well, you will need to create a custom mapping, that applies a lower-case (or upper-case) normalizer to the query and keyword field prior to matching.

2. If you use a custom mapping

Basically the same as above, however you will have to create a new subfield (or field) manually that uses the keyword analyzer (and possibly a normalizer in order for it to be case-insensitive).

P.S. As far as I am aware changing of a mapping is no longer possible in elasticsearch. This means that you will have to create a new index with the appropriate mapping, and then reindex your data to the new index.