Search code examples
regexsplitpcremultilinepcregrep

split pcregrep multiline matches


tl;dr: How can I split each multiline match with pcregrep?

long version: I have files where some lines start with a (lowercase) character and some start with a number or a special character. If I have at least two lines next to each other starting with a lowercase letter, I want that in my output. However, I want each finding to be delimited/split instead of being appended to each other. This is the regex:

pcregrep -M "([a-z][^\n]*\n){2,}"

So if I give a file like this:

-- Header -- 
info1 
info2 
something 
< not interesting > 
dont need this 
+ new section 
additional 1 
additional 2 

The result given is

info1 
info2
something 
additional 1
additional 2 

Yet, what I want is this:

info1 
info2 
something 

additional 1
additional 2

Is this possible and/or do I have to start using Python (or similar)? Even if it's recommended to use something else from here on, it would still be nice to know if it's possible in the first place.

Thanks!


Solution

  • The following sed seems to do the trick :

    sed -n '/^[a-z]/N;/^[a-z].*\n[a-z]/{p;:l n;/^[a-z]/{p;bl};a\
    
    }'
    

    Explanation :

    /^[a-z]/{           # if a line starts with a LC letter
      N;                   # consume the next line while conserving the previous one
      /^[a-z].*\n[a-z]/{   # test whether the second line also starts with a LC letter
        p;                   # print the two lines of the buffer
        l: n;                # define a label "l", and reads a new line
        /^[a-z]/{            # if the new line still starts with a LC letter
          p;                   # print it
          bl                   # jump back to label "l"
        }
        a\
                             # append a new line after every group of success 
      }
    }
    

    Sample run :

    $ echo '-- Header --
    > info1
    > info2
    > something
    > < not interesting >
    > dont need this
    > + new section
    > additional 1
    > additional 2 ' | sed -n '/^[a-z]/N;/^[a-z].*\n[a-z]/{p;:l n;/^[a-z]/{p;bl};a\
    >
    > }'
    info1
    info2
    something
    
    additional 1
    additional 2