Search code examples
regexelasticsearchregex-groupelasticsearch-queryelasticsearch-mapping

How to search with regex for fields containing whitespace-separated segments in Elasticsearch


I have a document which contains a field called info_list which is basically a string with space separated 9 segments.

Mapping of the field is

"info_list": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }

And the source document looks like

"_source": {
      "id": "1234",
      "date": "1614556800000",
      "info_list": [
        "1234 2D 5678 8765 5678 1111 2222 3333 1"
      ]
    }

The info list basically consists of 9 segments. For sake of question we can say that a,b,c,d,e,f,g,h,i are those 9 segments.

info_list = a+ ' ' + b+ ' ' + c +' ' + d+ ' ' e + ' ' + f + ' ' + g + ' ' + h + ' ' + i

Now suppose I want to search for c with value of 5678, the current implementation uses match_phrase query something like this

GET test/_search
{
  "query": {
    "match_phrase": {
      "info_list": "5678"
    }
  }
}

The issue with above approach is that ,even though I wanted search result having c = 5678, now if any segment in info_list string has 5678 it will match that, resulting in wrong search result.

I tried using regex query something like

GET /test/_search
{
  "query" : {
    "query_string" : 
    { "fields" : ["info_list"],
      "query" : ".* .* 5678 .*"
      
    }
    
  }
}

But this doesn't seem to work.Should I change the mapping of the field ? Any help or suggestions would be appreciated, since I am new to Elastic search.


Solution

  • Fixing the regex

    You've got to let the regex engine know exactly how many preceding groups there will be. On top of that, you'll need to use the .keyword field because the standard analyzer that's applied on text fields by default will have split the original string by whitespace and converted each token to lowercase -- both of which should be prevented if you aim to work with capture groups.

    Having said that, here's a working regexp query:

    GET /test/_search
    {
      "query": {
        "regexp": {
          "info_list.keyword": "( ?[a-zA-Z0-9]+){2} 5678 .*"
        }
      }
    }
    

    Extracting the groups before ingestion

    Should I change the mapping of the field?

    I'd say go for it. When you know which group you'll be targeting, you should, ideally, extract the groups before you ingest the documents.

    See, what I'd do in your case is the following:

    • Preserve the original info_list as a keyword for consistency
    • Extract the groups in the programming language of your choice, annotate them with keys a to i (analogously to the way you naturally think about said groups).
    • Store store them inside a nested field in order to guarantee that the connections between the keys and the values aren't lost due to array flattening.

    In concrete terms:

    1. Set up a mapping
    PUT extracted-groups-index
    {
      "mappings": {
        "properties": {
          "info_list": {
            "type": "keyword"
          }, 
          "info_list_groups": {
            "type": "nested",
            "properties": {
              "group_key": {
                "type": "keyword"
              },
              "value": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
    
    1. Ingest the doc(s)
    POST extracted-groups-index/_doc
    {
      "info_list": "1234 2D 5678 8765 5678 1111 2222 3333 1",
      "info_list_groups": [
        {
          "group_key": "a",
          "value": "1234"
        },
        {
          "group_key": "b",
          "value": "2D"
        },
        {
          "group_key": "c",
          "value": "5678"
        },
        { ... } // omitted for brevity
      ]
    }
    
    1. Leverage a pair of nested term queries:
    POST extracted-groups-index/_search
    {
      "query": {
        "nested": {
          "path": "info_list_groups",
          "query": {
            "bool": {
              "must": [
                {
                  "term": {
                    "info_list_groups.group_key": "c"
                  }
                },
                {
                  "term": {
                    "info_list_groups.value": "5678"
                  }
                }
              ]
            }
          }
        }
      }
    }
    

    Harnessing the full power of Elasticsearch 🚀

    The downside of the nested approach is that it'll increase your index size. Plus, the queries tend to get quite verbose and confusing. If you don't want to go that route, you can leverage what's called a custom analyzer.

    Such an analyzer is typically composed of:

    • a tokenizer (which receives character streams and outputs a stream of tokens -- usually words)
    • and a few token filters whose role it is to mold the tokens into the desired form.

    In concrete terms, the aim here is to:

    1. Take in the string 1234 2D 5678 8765 5678 1111 2222 3333 1 as a whole

    2. Locate the individual groups separated by whitespace

      --> (1234) (2D) (5678) (8765) (5678) (1111) (2222) (3333) (1)

    3. Annotate each group with its alphabetical index

      --> a:1234 b:2D c:5678 d:8765 e:5678 f:1111 g:2222 h:3333 i:1

    4. And finally split the resulting string by whitespace in order to use queries like a:1234 and c:5678

    All of this can be achieved through a combination of the "noop" keyword tokenizer, and pattern_replace + pattern_capture filters:

    PUT power-of-patterns
    {
      "mappings": {
        "properties": {
          "info_list": {
            "type": "text",
            "fields": {
              "annotated_groups": {
                "type": "text",
                "analyzer": "info_list_analyzer"
              }
            }
          }
        }
      },
      "settings": {
        "analysis": {
          "analyzer": {
            "info_list_analyzer": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": ["pattern_grouper", "pattern_splitter"]
            }
          },
          "filter": {
            "pattern_grouper": {
              "type": "pattern_replace",
              "pattern": "((?<a>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<b>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<c>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<d>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<e>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<f>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<g>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<h>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<i>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))",
              "replacement": "a:${a}b:${b}c:${c}d:${d}e:${e}f:${f}g:${g}h:${h}i:${i}"
            },
            "pattern_splitter": {
              "type" : "pattern_capture",
               "preserve_original" : true,
               "patterns" : [
                  "([a-i]\\:[a-zA-Z0-9]+)"
               ]
            }
          }
        }
      }
    }
    

    Note that the friendly-looking regex from above is nothing more than a repetitive named group catcher.

    After setting up the mapping, you can ingest the document(s):

    POST power-of-patterns/_doc
    {
      "info_list": [
        "1234 2D 5678 8765 5678 1111 2222 3333 1"
      ]
    }
    

    And then search for the desired segment in a nice, human-readable form:

    POST power-of-patterns/_search
    {
      "query": {
        "term": {
          "info_list.annotated_groups": "c:5678"
        }
      }
    }