Search and import multiple words from txt file on Biopython

Well, I have a FASTA file which has some info about a protein in a .txt and I want to search for "string" that comes after a pattern and import it/write it to another txt. It comes like this:

>gi|1168222|sp|P46098.1|
....(text)...
>gi|74705987|sp|O95264.1|
....(text)...

And I want to get all the accession numbers (acc): sp|**P46098**.1| and then save them in another file in a column. There are different acc throughout the text and what I want is what comes after the sp| and before the . or if it doesn't have a . is what is before the next |.

Is there any easy way of doing this in Biopython?

Thanks

Solution

This answer uses Biopython to the extent that it's possible to, then uses regular expressions for the rest (Biopython will get the id for you, but not the accession number alone):

from Bio import SeqIO
import re

with open('output.txt', 'w') as outFile: # open for writing
    for i in SeqIO.parse('input.txt', 'fasta'): # parse as FASTA
        m = re.search('sp\|(.*)\|', i.id) # look for sp|.*| in the id
        if m:
            outFile.write(m.group(1).split('.')[0] + '\n') # take only what's before the first dot, if any

Just as a note to the uninitiated: 'w' overwrites any previously existing file, while 'a' appends to it instead.

Also note that just using the regular expression match itself on the entire text (without using Biopython to parse out the FASTA ids first) would return the exact same result.