Search code examples
regexapache-nifi

Apache Nifi Expression Language: find part of content, which matches to regex


The task is to find the part of expression, which matches to selected regex and put this part in to value of new field.

The source file is in csv format:

   id;description
   1234;The latitude is 12324.24
   2345;12324.24 this value
   12234;12324.24 is latitude for 12234
   234;No information about latitude

My CSVReader in UpdateRecords:

enter image description here

The "input_schema" is:

 {
 "type": "record",
 "name": "geo_data",
 "fields": [
    {
      "name": "id",
      "type": [
         "int",
         "null"
       ]
    },
    {
      "name": "description",
        "type": [
         "string",
         "null"
       ]
    }
   ]
  }

CSVRecordSetWriter is:

enter image description here

I need to extract "12324.24"-value from "description" and, using UpdateRecord processor, create new field:

"latitude = 12324.24"

I was looking towards matchesRegex. For example:

 /description[matchesRegex(., '(\d+)(\.)(\d+)')]

But it only checks, if the entire expression matches to regex. For example, when "description" includes only "12324.24", matchesRegex works fine, but the field include not only latitude-value.

One important notice: field value "description" doesn't necessarily contain latitude. So it may include for example: "No information about latitude".

Therefore, substring, substringAfter, substringBefore and similar methods, based on the position (or values) of tokens in text, do not fit for my task.

I tried to use suggested by Endzeit approach:

  replaceRegex(/description, '.*?(\\d+.\\d+).*', '$1')

But it returns the entire value for description fields, which contains matched expression. For example, it returns "12324.24 is latitude" instead of expected: "12324.24"

Or it's better to look forward ExtractText? But problem is that this processor, based on the result of extraction, create only one attribute "latitude" for all records. The possible workaround is to use additionaly SplitText (before extraction) and MergeRecords (after ExtractText). But may be it is somethins overcomplicated?


Solution

  • You are using a "Filter", as described in the "Record Path Guide" inside the documentation.

    Your expression /description[matchesRegex(., '(\d+)(\.)(\d+)')] means:

    • take the field description
    • only if the fields value (referenced by .)
    • matches the provided regex.

    If the field's value does not match the query, the expression does not resolve to any field (and its value) but null instead. Just a side note, you can apply such filter expressions to multiple fields simultaneously, e.g. imagine description was an array holding multiple values, your filter could retain only array items that match.

    Instead of filtering values, you want to make use of a "Standalone Function" to transform your original input. The following expression replaceRegex(/description, '.*?(\d+[.]\d+).*', '$1'):

    • retrieves the value of the field description
    • matches it against the provided regular expression, and
    • replaces all matches with the provided substitution

    The substitution in turn can reference capture groups, either by index ($1) or by name ${groupName}.

    A complete configuration of the UpdateRecord used contains:

    {
      "/latitude": "replaceRegex(/description[matchesRegex(., '.*?(\\d+.\\d+).*')], '^.*?(\\d+.\\d+).*$', '$1')",
      "Record Writer": "d79fc7b7-872c-3fd0-8a2a-cf4ff21bfa2e",
      "Record Reader": "44045758-d88e-3c93-ae9f-2173c7860a26",
      "Replacement Value Strategy": "record-path-value"
    }
    

    I've uploaded a Flow definition to GitHub that provides a working example. It works both for NiFi 2.0 and 1.22.0 (just needs some property name adjustments in the latter). You can find it here.