Search code examples
regexgreppcregrep

pcregrep or grep: searching with lookaheads not working


I am trying to search for a regex with lookahead its not working in pcregrep or grep

I want to search for bits of sections

  • which may span over multiple lines,
  • which start with PQXY at the beginning of a line and
  • end with OFEJ at the end of the line and
  • does not contain either PQXY or OFEJ in between

Generall i use the following in sublime text find and works well

(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)

Now i want to find the count of such occurences so i am trying to use grep or pcergrep, both are not working.

pcregrep -c "(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)" file.txt
zsh: event not found: PQXY|OFEJ).)

and with grep

$ grep -c -zoP "(?s)(^PQXY(?:(?!PQXY|OFEJTRANS).)*OFEJTRANS\n)" CB_raw_testing_21_feb_CORRECTIONS_0002.txt
zsh: event not found: PQXY|OFEJTRANS).)

How can i do this

Answer based on @paxdiablo and @anubha.

The main error was the single quotes as addressed by @paxdiablo

$ pcregrep -c -M '(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt 
0

The regex solution is to add (?s) based on @anubha. Ofcourse \n also works instead of (\R|\z)

$ pcregrep -c -M '(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt
11726

Solution

  • zsh: event not found: PQXY|OFEJ).)

    Since this is zsh raising the error, it's almost certainly because it's trying to process the stuff within the double quotes. To protect it from that, you should use single quotes, such as:

    pcregrep -c '(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt
    

    I don't have pcregrep installed but here's a transcript showing the problem with just echo:

    pax> echo "(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ)"
    zsh: event not found: PQXY|OFEJ).)
    
    pax> echo '(?s)(^PQXY(?:(?OFEJ)'
    (?s)(^PQXY(?:(?OFEJ)
    

    In terms of solving the problem rather than using a specific tool, I would actually opt for awk(a) in this case. You can do something like:

    awk '/^PQXY/     { s = $0; c = 1; next}
         /OFEJ$/     { if (c == 1) { print s""ORS""$0; c = 0 }; next }
         /OFEJ|PQXY/ { c = 0; next }
         c == 1      { s = s""ORS""$0 }' inputFile
    

    This works by using a string and flag to control lines collected and state, initially they are an empty string and zero.

    Then, for each line:

    • If it starts with PQXY, store the line and set the collection flag, then go to next input line.
    • Otherwise, if it ends with OFEJ and you're collecting, output the collected section and stop collecting, then go to next input line.
    • Otherwise, if it has either of the strings in it, stop collecting, move to next input line.
    • Otherwise, if collecting, append current line and move (implicitly) to next input line.

    I've tested this with some limited test data and it seems to work okay. Here's the bash script(b) I used for testing, you can add as many test cases as you need to be comfortable it solves your problem.

    for i in \
        "PQXY 1\nabc\n2 OFEJ\n" \
        "PQXY 1\nabc\n2 OFEJx\n" \
        "PQXY 1\nabc\n  PQXY \n2 OFEJ\n" \
        "PQXY 1\nabc\n  OFEJ \n2 OFEJ\n" \
        "PQXY 1\nabc\ndef\nPQXY 2\n2 OFEJ\n" \
    ; do
        echo "$i:"
        printf "$i" | awk '
            /^PQXY/     { s = $0; c = 1; next}
            /OFEJ$/     { if (c == 1) { print s""ORS""$0; c = 0 }; next }
            /OFEJ|PQXY/ { c = 0; next }
            c == 1      { s = s""ORS""$0 }' | sed 's/^/    /
        '
    done
    

    Here's the output so you can see it in action:

    PQXY 1\nabc\n2 OFEJ\n:
        PQXY 1
        abc
        2 OFEJ
    PQXY 1\nabc\n2 OFEJx\n:
    PQXY 1\nabc\n  PQXY \n2 OFEJ\n:
    PQXY 1\nabc\n  OFEJ \n2 OFEJ\n:
    PQXY 1\nabc\ndef\nPQXY 2\n2 OFEJ\n:
        PQXY 2
        2 OFEJ
    

    (a) In my experience, if you've tried three things with a grep-style regex without success, it's usually faster to move to a more advanced tool :-)


    (b) Yes, I know it's written in bash rather than zsh but that's because:

    • it's a test program to show you that awk works, hence the language used is irrelevant; and
    • I'm far more comfortable with bash tahn zsh :-)