Search code examples
shellawktext-processing

Print paragraphs that contain a set of patterns (order of occurrence doesn't matter)


Given a set of patterns A = {a_1, a_2, ..., a_n}, I want to print those paragraphs that contain all those patterns: a_1, a_2, ..., a_n.

  • A paragraph starts with a non-whitepace character and ends with a line which only contains whitespace characters.
  • The order in which the patterns appear in the paragraph doesn't matter.

Let's suppose I have the following file

$ cat main.txt
a

b

c

a
b

b
c

c
a

a
b
c

I want to print all those paragraphs that contain the following patterns: \<a\>, \<b\>, \<c\>. That is, the output should be

$ {some command here}
a
b
c

I've written the following command. However, this consider those lines containing only spaces or tabs as part of a paragraph (recall that lines containing only whitespaces must not be considered part of a paragraph). I think this could be improved by executing awk once.

$ awk -v RS= '/\<a\>/ {print $0,"\n"}' main.txt |\
  awk -v RS= '/\<b\>/ {print $0,"\n"}' |\
  awk -v RS= '/\<c\>/ {print $0}'

a
b
c

Are there more effective ways of accomplishing this?


Solution

  • You have to prepare your input for the empty RS:

    awk '!NF{$0=""}1' main.txt > input.txt
    

    This way, no blank (non-empty) lines will be considered part of the paragraph and you remove the possibility these blanks to be part of one of your patterns. Actually it's hard to be part of the pattern (but not impossible), but it is very possible to unify paragraphs, so this input "a\n \nb\n\c" would be considered one paragraph that matches all patterns.


    Of course, you have to run awk once to test all patterns together for each paragraph. But even once at a time like you do it now, it works, if you prepare the input.

    awk -v RS= '/\<a\>/ && /\<b\>/ && /\<c\>/{print $0,"\n"}' input.txt