I have a text file with inconsistent formatting, but the relevant sections look like:
CDS complement(99074..99808)
/note="important in cell to cell spread of the virus, a
tegument protein"
/codon_start=1
As part of an existing bash pipeline, I need to remove the pattern of /note="anything" to get
CDS complement(99074..99808)
/codon_start=1
I've tried several methods to inverse grep, but the closest only works if the match is not spanning multiple lines:
perl -ne '/\/\bnote\b\="[^"]+"/||print' file.txt
I can match the strings I wish to remove by checking with the following perl one-liner, but so far I cannot combine the two methods to invert the match and remove the strings that span multiple lines:
perl -0777 -ne 'print "$1\n" while ( /(\s+\/\bnote\b\="[^"]+")/sg )' file.txt
Doing the first one-liner as -0777 results in no output.
The simple approach involves reading the entire stream into memory. This is done by telling Perl to treat the whole file as a single line using -0777
or the new -g
.
perl -0777pe's{^\s*/note="[^"]*"\n}{}mg'
Doing it a line at a time is more complicated since it requires a flag to indicate whether we're in the string or not.
perl -ne'
$f ||= m{^\s*/note="};
print if !$f;
$f &&= !m{"$};
'