Search code examples
regexlinuxgrepgnupcre

grep - RegEx multiple-criteria select


Given a file containing this string:

IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@IT1*1*CS*VN*ABC@SAC*X*500@REF*ZZ*BAR@IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*OK@

The goal is to extract the following:

IT1*1*EA*VN*ABC@SAC*X*500@REF*ZZ*BAR@

With the criteria being:

  1. The IT1 "line" must contain *EA*
  2. The REF line must contain BAR

Some notes for consideration:

  • "@" can be thought of as a line break
  • A "group" of lines contains lines starting with IT1 and ending with REF
  • I am running GNU grep 3.7.

The goal is to select the "group" of lines meeting the criteria.

I tried the following:

grep -oP "IT1[^@]*EA[^@]*@.*REF[^@]*BAR[^@]*@" file.txt

But it captures characters from the beginning of the example.

Also tried to use lookarounds:

grep -oP "(?<=IT1[^@]*EA[^@]*@).*?(?=REF[^@]*BAR[^@]*@)" file.txt

But my version of grep returns:

grep: lookbehind assertion is not fixed length


Solution

  • Your issue is that .* will match characters from the first IT1 with EA to the last REF with BAR. You need to ensure the match doesn't go past the next IT1, which you can do by replacing .* with a tempered greedy token (?:(?!@IT1).)*:

    IT1[^@]*EA[^@]*@(?:(?!@IT1).)*REF[^@]*BAR[^@]*@
    

    This will only match from an IT1 to its corresponding REF.

    Regex demo on regex101