Search code examples
regexpowershellregex-group

Regex Group, problems with catching IPs


I post slightly changed Logs down.

I have an regex to match 3 different groups in one log line, i match the Time, the ip and the messages that the SMTP server recieved.

i tryed it with the following regex (\d{2}.\d{2}.\d{4} \d{2}:\d{2}:\d{2}).*(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})..disconnected.?\s+(\d+) message[s]

The problem is only the 2. Group with the IP`s to show you the problem in the first line the ip is 11.132.8.61 what regexr cathces is only 1.132.8.6 so he leaves some numbers out. I thought with the \d{1,3} he will match all three or two numbers if there is more than one, he also does is in the second bracket but not in the first or last.

[16A4:000C-0780] 01.12.2020 01:00:07   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000E-07F8] 01.12.2020 01:00:07   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000E-0780] 01.12.2020 01:00:07   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000C-0780] 01.12.2020 01:00:07   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000C-07F8] 01.12.2020 01:00:08   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000C-0780] 01.12.2020 01:04:51   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000C-07F8] 01.12.2020 01:30:46   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000C-0780] 01.12.2020 01:30:46   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000E-0780] 01.12.2020 01:33:25   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received
[16A4:000C-07F8] 01.12.2020 01:33:25   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received

[12CC:0015-118C] 30.11.2020 05:08:59   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received
[12CC:000B-118C] 30.11.2020 05:08:59   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received
[12CC:000F-0FF0] 30.11.2020 05:08:59   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received
[12CC:000F-120C] 30.11.2020 05:10:05   SMTP Server: bsicip03.dd.example.com (12.99.81.53) disconnected. 1 message[s] received
[12CC:0015-118C] 30.11.2020 05:10:05   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received
[12CC:0014-118C] 30.11.2020 05:10:05   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received
[12CC:000B-120C] 30.11.2020 05:10:05   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received
[12CC:000A-120C] 30.11.2020 05:10:05   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received 

The expected out-put would be 
match[1] = 01.12.2020 01:00:07
match[2] = 11.132.8.61
match[3] = 1

Solution

  • Change .* to .*? (or, given that that you can expect least one character to occur between the capture groups, .+?) to make the subexpression non-greedy.

    That way, .* doesn't "steal" up to two leading digits from the what the following \d{1,3}subexpression matches.

    To give a simple example:

    # !! BROKEN: greedy.
    PS> if (' 123' -match '.*(\d{1,3})') { $Matches[1] }
    
    3 # !! Only the LAST digit matched, because .* matched as much as it
      # !! could while still matching \d{1,3}
    
    # OK: non-greedy.
    PS> if (' 123' -match '.*?(\d{1,3})') { $Matches[1] }
    
    123 # OK - all 3 digits matched, because .*? matched as little as it
        # could while still matching \d{1,3}
    

    To put it all together (note that I'm using .+?, also in lieu of .. before disconnected):

    '[16A4:000C-0780] 01.12.2020 01:00:07   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received',
    '[12CC:0015-118C] 30.11.2020 05:08:59   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received' |
      ForEach-Object {
        if ($_ -match '(\d{2}\.\d{2}\.\d{4} \d{2}:\d{2}:\d{2}).+?(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).+?disconnected\.?\s+(\d+) message\[s\]') {
          [pscustomobject] @{
            Count = $Matches[3]
            Timestamp = $Matches[1]
            IP = $Matches[2]
          }
        }
      }
    

    The above yields:

    Count Timestamp           IP
    ----- ---------           --
    1     01.12.2020 01:00:07 11.132.8.61
    1     30.11.2020 05:08:59 12.99.81.53
    

    Note:

    • In general (it may not be necessary in your case), you could make the regex more robust by using word-boundary assertions, \b, around subexpressions such as .\d{1,3} so that they don't accidentally match inside longer runs of digits, or you could explicitly stipulate that a non-digit (\D) precede and follow.

    Alternative solution using the -split operator:

    As Lee Daley points out, you could use -split, the string splitting operator to split your lines into fields, as a conceptually simpler alternative to regexes:

    '[16A4:000C-0780] 01.12.2020 01:00:07   SMTP Server: 11.132.8.61 disconnected. 1 message[s] received',
    '[12CC:0015-118C] 30.11.2020 05:08:59   SMTP Server: bsicip01.dd.example.com (12.99.81.53) disconnected. 1 message[s] received' |
      ForEach-Object {
        $fields = -split $_
        if ($fields[-4] -eq 'disconnected.') {
          [pscustomobject] @{
            Count     = $fields[-3]
            Timestamp = '{0} {1}' -f $fields[1], $fields[2]
            IP        = $fields[-5].Trim('()')
          }
        }
      }
    

    The above yields the same as the regex-based solution.