Search code examples
biopythonfasta

Search and import multiple words from txt file on Biopython


Well, I have a FASTA file which has some info about a protein in a .txt and I want to search for "string" that comes after a pattern and import it/write it to another txt. It comes like this:

>gi|1168222|sp|P46098.1|
....(text)...
>gi|74705987|sp|O95264.1|
....(text)...

And I want to get all the accession numbers (acc): sp|**P46098**.1| and then save them in another file in a column. There are different acc throughout the text and what I want is what comes after the sp| and before the . or if it doesn't have a . is what is before the next |.

Is there any easy way of doing this in Biopython?

Thanks


Solution

  • This answer uses Biopython to the extent that it's possible to, then uses regular expressions for the rest (Biopython will get the id for you, but not the accession number alone):

    from Bio import SeqIO
    import re
    
    with open('output.txt', 'w') as outFile: # open for writing
        for i in SeqIO.parse('input.txt', 'fasta'): # parse as FASTA
            m = re.search('sp\|(.*)\|', i.id) # look for sp|.*| in the id
            if m:
                outFile.write(m.group(1).split('.')[0] + '\n') # take only what's before the first dot, if any
    

    Just as a note to the uninitiated: 'w' overwrites any previously existing file, while 'a' appends to it instead.

    Also note that just using the regular expression match itself on the entire text (without using Biopython to parse out the FASTA ids first) would return the exact same result.