tl;dr: How can I split each multiline match with pcregrep?
long version: I have files where some lines start with a (lowercase) character and some start with a number or a special character. If I have at least two lines next to each other starting with a lowercase letter, I want that in my output. However, I want each finding to be delimited/split instead of being appended to each other. This is the regex:
pcregrep -M "([a-z][^\n]*\n){2,}"
So if I give a file like this:
-- Header --
info1
info2
something
< not interesting >
dont need this
+ new section
additional 1
additional 2
The result given is
info1
info2
something
additional 1
additional 2
Yet, what I want is this:
info1
info2
something
additional 1
additional 2
Is this possible and/or do I have to start using Python (or similar)? Even if it's recommended to use something else from here on, it would still be nice to know if it's possible in the first place.
Thanks!
The following sed
seems to do the trick :
sed -n '/^[a-z]/N;/^[a-z].*\n[a-z]/{p;:l n;/^[a-z]/{p;bl};a\
}'
Explanation :
/^[a-z]/{ # if a line starts with a LC letter
N; # consume the next line while conserving the previous one
/^[a-z].*\n[a-z]/{ # test whether the second line also starts with a LC letter
p; # print the two lines of the buffer
l: n; # define a label "l", and reads a new line
/^[a-z]/{ # if the new line still starts with a LC letter
p; # print it
bl # jump back to label "l"
}
a\
# append a new line after every group of success
}
}
$ echo '-- Header --
> info1
> info2
> something
> < not interesting >
> dont need this
> + new section
> additional 1
> additional 2 ' | sed -n '/^[a-z]/N;/^[a-z].*\n[a-z]/{p;:l n;/^[a-z]/{p;bl};a\
>
> }'
info1
info2
something
additional 1
additional 2