Search code examples
javaregexstreamsets

Regex in Streamsets


Hi I want to break a log file using Streamsets. the log is like,

Deny tcp src dmz:77.77.77.7/61112 dst dmz:55.55.56.57/139 by access-group "outside_access_in" [0x8b3ecfdc, 0x0]

There may be more than 2 IP's also in the log and I'm trying to capture the only 1st and 2nd IP address from my log. It's written that Streamsets use Java REGEX patterns.

what I did till now in Expression Evaluator processor in Streamsets is,

${str:regExCapture(record:value('/Message'),'(\\d+[.]\\d+[.]\\d+[.]\\d+/?\\d*)', 1)}

Any idea how to capture the 2nd IP?


Solution

  • You may use

    ${str:regExCapture(record:value('/Message'),'^(?:.*?(\\d+(?:[.]\\d+){3}(?:/\\d+)?)){2}', 1)}
    

    See the regex demo.

    Details

    • ^ - start of string
    • (?:.*?(\\d+(?:[.]\\d+){3}(?:/\\d+)?)){2} - two consecutive occurrences of
      • .*? - any 0+ chars other than line break chars, as few as possible
      • (\\d+(?:[.]\\d+){3}(?:/\\d+)?) - Capturing group 1 (its value will be returned by str:regExCapture since the last argument is set to 1):
        • \\d+ - 1+ digits
        • (?:[.]\\d+){3} - three occurrences of . and 1+ digits
        • (?:/\\d+)? - an optional sequence of / and 1+ digits.

    Since the contents in a group is re-written when several occurrences are captured within one match operation, Group 1 will only contain the second IP value.

    Note that a better (safer, more precise) IP pattern would be (?:25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)(?:\\.(25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)){3}, see Extract ip addresses from Strings using regex. So, you may also write the command as

     ${str:regExCapture(record:value('/Message'),'^(?:.*?\\b((?:25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)(?:\\.(?:25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)){3}(?:/\\d+)?)){2}', 1)}
    

    See another regex demo.