Search code examples
shellperlawksedpattern-matching

How to select lines between two same marker patterns which may occur multiple times with awk/perl or any other command line tool


Using awk or perl or any command line tool, how can I select lines which are occurring between two same marker patterns? There may be multiple sections marked with these patterns. So, the block starts at first occurrence of the pattern and ends at the second occurrence of the pattern. Everything ignored after that till the next occurrence of the pattern which is considered first occurrence and repeat.

For example: Suppose the file contains:

abc
def1
ghi1
jkl1
abc
1
2
3
abc
def2
ghi2
jkl2
abc
4
5
6
abc
stu
abc

And the pattern is abc. So, I need the output as:

abc
def1
ghi1
jkl1
abc
abc
def2
ghi2
jkl2
abc
abc
stu
abc

I tried various solutions from the other related questions, but they were all for different start and end patterns.

How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)?

Extract lines between two patterns from a file

Extract text between 2 markers Extract lines between 2 tokens in a text file using bash

I updated the solutions as per my need, which looked something like this:

perl -lne 'if(/abc/){$flag=1; print} elsif(/abc/){$flag=0}' file.txt
awk '/abc/,/abc/' file.txt

I only ended up getting lines which contain the pattern and not the text block between them.

How can I do this in awk or perl or any command line tool, such that I get the text block with same patterns ?


Solution

  • An easy way is with the range operator, the three-dot variant

    perl -wne'/abc/ ... /abc/ and print' data.txt
    

    Another way, with an explicit flag and all but more concise

    perl -wnlE' /abc/ and $f ^= 1; $f and say' data.txt
    

    This doesn't print the end marker though. To have both start and end markers printed

    perl -wnlE' ($f or /abc/) and say; /abc/ and $f ^= 1' data.txt
    

    Explanation --

    In Perl all logical operators short-circuit. Consider A and B: if the first expression (A) evaluates to something "falsey" then B is not evaluated -- that code doesn't run. Thus A and B is mostly equivalent to if (A) { B }.

    I use the short-circuiting nature here to streamline code for a one-liner; it's normally far clearer in normal code to write it out nicely. So, the first statement amounts to

    • if ($f or /abc/) { say } -- print the line if $f evaluates to "truthy" (flag is set) or we are on the line with abc. The second condition is matched by regex, $_ =~ m/abc/, where m may be omitted with // delimiters and the pattern binds to $_ by default so that can be omitted as well -- thus just /abc/, which returns true/false.

      Now for setting that flag...

    • Next statement, we test /abc/ and if-it-matches-then (per the short-circuiting and) we do $f ^= 1. This second expression uses bitwise ^, as follows.

      When two numbers are bitwise XOR-ed each pair of their bits is XOR-ed -- the resulting bit is set if one of them is set but not the other, and it's not set otherwise. So 0101 ^ 1100 gives 1001 (higher four bits omitted here; needed for testing)

      Then doing it with 1 results in flipping the lowest bit: 6 ^ 1 produces 7 (0110 ^ 1 --> 0111) while 7 ^ 1 returns 6. And we flip between 0 and 1. Then we assign that back, $f = $f ^ 1; for which there is syntax $f ^= 1.

      Thus this sets the flag if it is unset (0-->1) and the other way round, on a line with abc, as needed for the next line.


    Of course since this reads from a file and line endings aren't touched one can use print instead of say and then only -wne switches are needed.

    I just liked say better here. Also, -lE with say handles a case where this filter is fed strings wihtout linefeeds, as well.