Search code examples
regexelasticsearchopensearch

Regexp filter in Opensearch (or Elasticsearch)


Here is a sample json document indexed in Opensearch:

{
  "_index": "filebeat-7.12.1-2024.08.28",
  "_type": "_doc",
  "_id": "RF64mZEBFMf-66jeR0WD",
  "_version": 1,
  "_score": null,
  "_source": {
    "cloud": {},
    "message": "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:1.5s",
    "event": {
      "created": "2024-08-28T18:01:15.557Z"
    }
  },
  "fields": {
    "event.created": [
      "2024-08-28T18:01:15.557Z"
    ]
  },
  "highlight": {
    "logger.type": [
      "@opensearch-dashboards-highlighted-field@WLS@/opensearch-dashboards-highlighted-field@"
    ],
    "message": [
      "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:@[email protected]@/opensearch-dashboards-highlighted-field@"
    ]
  },
  "sort": [
    1,
    1724868075557
  ]
}

I wish to regexp filter on field message here its mapping

        "message" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }

Using this DSL filter to regexp match the time part of the message field works:

{
  "query": {
    "regexp": {
      "message": {
        "value": "[0-9]\\.?[0-9]*s"
      }
    }
  }
}

Using this DSL filter to regexp match the whole text part of the message field fails:

{
  "query": {
    "regexp": {
      "message": {
        "value": "Q.*[0-9]\\.?[0-9]*s"
      }
    }
  }
}

This DSL filter also fails:

{
  "query": {
    "regexp": {
      "message.keyword": {
        "value": "Q.*[0-9]\\.?[0-9]*s"
      }
    }
  }
}

The matched message field text value in the above sample:

"%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:1.5s"

The difference in the regexp patterns:

"value": "Q.*[0-9]\\.?[0-9]*s"
"value":    "[0-9]\\.?[0-9]*s"

Please advise a DSL filter with regular expression pattern like "Query from ES took:[0-9]\\.?[0-9]*s" to match text like Query from ES took:12.553s

The time number can range from 0 to 999.999


Solution

  • You are using this mapping for the message field:

    {
      "message": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
    

    If you are using for example the standard tokenizer and you are using this query, the message field will be tokenized and the regex will search for a match in the tokens, where one of the tokens is 1.5s so there is a match:

    {
      "query": {
        "regexp": {
          "message": {
            "value": "[0-9]\\.?[0-9]*s"
          }
        }
      }
    }
    

    If you are using this query:

    {
      "query": {
        "regexp": {
          "message.keyword": {
            "value": "Q.*[0-9]\\.?[0-9]*s"
          }
        }
      }
    }
    

    You are searching in the keyword field which is not analyzed and should have an exact match. If you are using a regex, you should match the whole field by updating the regex to:

    {
      "query": {
        "regexp": {
          "message.keyword": {
            "value": ".*Q.*[0-9]\\.?[0-9]*s"
          }
        }
      }
    }
    

    If there is more text after the final s char you can match the rest of the line with:

    "value": ".*Q.*[0-9]\\.?[0-9]*s.*"
    

    Note that you can test what the tokens look like by using the _analyze api by making a POST request using this payload:

    {
      "analyzer": "standard",
      "text": "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:1.5s"
    }
    

    Then you will see that there is a token "token": "1.5s"

    The docs state:

    The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

    There is a section about "Word Boundary Rules" https://unicode.org/reports/tr29/#Word_Boundary_Rules where it mentions:

    Do not break within sequences, such as “3.2” or “3,456.789”.

    So your initial regex for the message field [0-9]\\.?[0-9]*s matches 1.5s