The overall goal here is to remove a block of text starting with a particular string and ending with a positive lookahead. From the testing I've done, it seems that newlines are causing the problem, but I'm not sure what exactly is going on or the best way to fix it.
More context: I want to remove taxa from a .fasta file, including the taxon name and header information and the associated sequence. (fasta format begins with a header >locusname-locusnumber-species_name |locusname-locusnumber \n). Missing data in the sequence is coded as "-". Eventually I would like to do this for several species_names and do so for each of several thousand files in a directory.
I presumed this would be a simple task to do as a perl one-liner in bash (Ubuntu 18.04.2). As an example, from the excerpt below I would like to remove the entire sequence of Pseudomymrex seminole D1367, i.e. the string that starts with >uce-483_Pseudomyrmex_seminole_D1367 |uce-483 and ends with the newline before >uce-483_Pseudomyrmex_seminole_D1435. . ..
For this, I have: perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367)[\s\S]+(?=>)//' infile.fasta > outfile.fasta
or equivalently perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367(.)+(?=>)//s' infile.fasta > outfile.fasta
Both of these seem to have no effect at all (i.e. diff infile.fasta outfile.fasta
is empty.) If I remove the positive lookahead, it works correctly but only up to the first newline.
Here's an excerpt from the .fasta for context and testing:
>uce-483_Pseudomyrmex_seminole_D1366 |uce-483
------------------------------------------------------------
---------------------------------------------------tgtaaacgt
tataatacatgcgtatgaaaaaaaaaagtgaacacccggtacgtacccgtgctgaaacgt
tcagatttacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata----tgtgtgtgtgtgcgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaatgttgaattattttcactcgctc
cacgcgcgagtgttcgctccttttacgcacaacgagtccttctgctgcagc--gagatag
aaaatatttttgcgcggtaatcgtaaacgtatgagtgcctttcgacgtgaattctcttat
ggcagttctcacggtgtaaattataatcgaattaacattgcgagtgtgatctcaatataa
ttatagcgtctaagaacaaacacgtaacatgcacacacacacacacacac----------
---
>uce-483_Pseudomyrmex_seminole_D1367 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--ttcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatg---atatatatgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactctgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtacgcgcgcacacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatg-----------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
>uce-483_Pseudomyrmex_seminole_D1435 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------tacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaa-----------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
With -p
(or -n
) the one-liner is reading a line at a time; so it just can't match multiline patterns. One solution is to "slurp" the whole file in, if it isn't too large (see end for line-by-line solution)
perl -0777 -pe'...' in > out
See Command Switches in perlrun.
Then, the code shown in the question has an unbalanced parenthesis and it doesn't compile. Further, there is no reason to capture those .
s so drop the parentheses around. Next, the pattern
s/>.+Pseudomyrmex_seminole_D1367...//;
matches everything from the very first >
to the name of interest, so all preceding sequences are matched and removed as well. Instead, match >[^>]+...D1367
for example, so everything that isn't >
after a >
, to that phrase.
Finally, the last .+(?=>)
will match everything to the very last >
and thus the regex will remove all following sequences, not what you want according to the description. Instead, limit it to match to the first following >
, either by making it "non-greedy" with .+?(?=>)
or, more simply, with [^>]+
.
All corrected
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367[^>]+//' in > out
Note that there is no need for /s
modifier now, since its purpose is to make .
match a newline and here we don't need that since the [^>]
does match newlines as well (anything other than >
). The quantifier is +?
to (hopefully) prevent backtracking each whole sequence that doesn't match.
Or, with your original use of lookahead
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367.+?(?=>)//s' in > out
These work as expected with your sample, as well as with an extended example I made up with further sequences (>...
) added.
For reference, and since a fasta file can be too big to slurp into a string, here it is line by line.
Once you see the >...
line of interest set a flag; print a line if that flag isn't set (and if we aren't on that very line). Once you reach the next >
clear the flag (print that line, too).
perl -ne'
if (/^>.+?Pseudomyrmex_seminole_D1367/) { $f = 1 }
elsif (not $f) { print }
elsif (/^>/) { $f = 0; print }
' in > out
I suspect that this may also perform considerably better on very large files.
The regex in the first solution has to scan each sequence whole in order to find that it is not the one of interest; it is only once it hits the next >
that it can decide that the sequence doesn't match (and with no backtracking, hopefully, since +?
would've stopped it had the right phrase been encountered).
Here the code mostly checks the first character and a flag.
So it's an incomparably lesser workload here -- but here the regex engine is started up on every line, and that is expensive. I can't tell with confidence how they stack against each other without trying.