Search code examples
linuxawksedgrep

Extract a part of a line from a log file


What is the fastest way to extract a substring of interest from input such as the following?

MsgTrace(65/26)noop:user=xxx=INBOX:cmd=534
ImapFetchComplete(56/39)user=xxxxxxxxxx

Desired output (i.e., the :-terminated string following the string MsgTrace(65/26) in this example):

noop

I tried the following, but without success:

egrep -i "[a-zA-Z]+\(.*\)[a-z]+:"

Solution

  • grep by default returns the entire line when a match is found on a given input line.

    While option -o restricts the output to only that part of the line that the regex matched, that is still not enough in this case, because you want a substring of that match.

    However, since you're on Linux, you can use GNU grep's -P option (for support of PCREs, Perl-compatible regular expression), which allows extracting a submatch by way of features such as \K (drop everything matched so far) and (?=...) (a look-ahead assertion that does not contribute to the match):

    $ grep -Po  "[a-zA-Z]\(.*\)\K[a-z]+(?=:)" <<'EOF'
    MsgTrace(65/26)noop:user=xxx=INBOX:cmd=534
    ImapFetchComplete(56/39)user=xxxxxxxxxx
    EOF
    noop  # output
    

    Optional background information:

    Ed Morton points out (in a since-deleted comment) that GNU grep's man page still calls the -P option "highly experimental" that may "warn of unimplemented features", but the option has been around for years, and in practice I have yet to see a warning or a performance problem - YMMV.

    In the case at hand, the above command even outperforms sed and awk solutions - see NeronLeVelu's helpful performance comparison.

    The interesting article Ed points to discusses a potential performance problem that can surface with regex engines such as used by grep -P (via the PCRE library), Perl itself, and many other widely used (and mature) regex engines, such as in Python, Ruby, and PHP:

    • In short: the recursive backtracking algorithm employed by these engines can result in severe performance degradation with "pathological" regexes that string together long sequences of subexpressions with variable-length quantifiers, such as (a longer version of) a?a?a?a?aaaa to match aaaa.
    • The article argues that backtracking is only truly required when a regex contains backreferences, and that a different, much faster algorithm should be employed in their absence.