Search code examples
regexpcrefreeswitchlogfilesubstring

Simple REGEX to print specific entries from a string


I have a log file which is full of entries like the one below:

2017-07-13 11:23:43.717948 [CRIT] mod_dptools.c:1713 SRC=7479569217;7479569217;768733974848304;7479569217;300067;333;-1

I'm trying to print specific values between ; which are numeric (always). For example, I want to print the 1st, 3rd and 5th number between ;.

I tried this pattern:

(?=;).+?(?=;).+?.+?(?=;)

It will print the 2nd and the 3rd. Not sure how to print for example the 2nd and the 4th without also print the 3rd...

UPDATE:

Maybe I was not clear enough or the example was not in its best form. So let me add some more info to it:

2017-07-13 11:23:43.717948 [CRIT] mod_dptools.c:1713 SRC=123;1234567890;00000000;2222222;7479569217;87654321;300067;333;-1

My expected output is: 123;00000000;7479569217;300067;333;-1

That means the 1st number, then the 3rd, the 5th, the 6th, the 7th, then the 8th.

Best would be to able to select later if I need changes, like printing the 2nd, the 3rd, the 4th and the 5th entry only.


Solution

  • If you trust the data in your logfile and you don't want to validate your values to only contain - and numbers, then you can just use a negated character class containing ; (this will improve pattern efficiency) and only parenthetically wrap the values that you want.

    Pattern: (Demo)

    #not captured--vv------------vv
         =([^;]*;)[^;]*;([^;]*;)[^;]*;([^;]*;)([^;]*;)([^;]*;)([^;]*;)(.*)
             $1            $2            $3      $4      $5      $6    $7
    

    Notice that the last capture group ($7) uses a dot instead of a negative character class. This is so the pattern does not try to match on the next line. I assume this is an important feature because your logfile will have many lines of data in it. (if not, the final capture group can be like the others before it)

    I am using * as a zero-or-more quantifier, in case the logfile can deliver empty values between the semicolons. If the logfile always contains a number for each value, then + can be used as a quantifier.

    If you need to validate the values, Usagi's pattern is suitable.

    Consolidating my capture groups like this: =([^;]*;)[^;]*;([^;]*;)[^;]*;([^;]*;[^;]*;[^;]*;[^;]*;.*) or =([^;]*;)[^;]*;([^;]*;)[^;]*;((?:[^;]*;){4}.*) successfully reduces the total number of capture groups and improves pattern efficiency & brevity, but makes the pattern slightly harder to update in the future. A more verbose pattern will make capture group changing a snap. It is up to you which pattern to select based on Validation, Efficiency, Brevity, and Maintainability.