Search code examples
regexmacosgreptext-extraction

Regex select several lines until two consecutive new lines is not working on Mac


I need to extract several lines of text (which vary in length along the 500 MB document) between a line that starts with "Query #" and two consecutive carriage returns. This is being done on a Mac. For example, the document format is:

Query #1: 020.1-Bni_its1_2019_envio1set1

lines I need to extract


Alignments (the following lines I don't need)

xyz
xyx

Query #2: This and the following lines I need. And so on.

There are always exactly two carriage returns before the word "Alignments". So basically I need all the lines from Query #.: until Alignments.

I tried the following regex, but I only recover the first line.

ggrep -P 'Query #.*?(?:[\r\n]{2}|\Z)'

I have tested the regex with multiple iterations at Regex101, but I have not yet found the answer.

The expected output is:

Query #1.   Text.

Lines I need to extract

Query #2: This and following lines I need.

Lines I need.

Query #....

Solution

  • With pcregrep, you can use

    pcregrep -oM 'Query #.*(?:\R(?!\R{2}).*)*' file.txt > results.txt
    

    Here,

    • o - outputs matched texts
    • M - enables matching across lines (puts line endings into "pattern space")
    • Query #.*(?:\R(?!\R{2}).*)* matches
      • Query # - literal text
      • .* - the rest of the line
      • (?:\R(?!\R{2}).*)* - zero or more sequences of a line break sequence (\R) not immediately followed with two line break sequences ((?!\R{2})) and then the rest of the line.

    Test screenshot:

    enter image description here