Search code examples
pythonbioinformaticsfasta

Creating lists of organism and dna sequence data from a multiple fasta file


I am working with DNA sequence data in the fasta format and need to create 2 lists containing the organism's names and sequences. I came across the following post Add multiple sequences from a FASTA file to a list in python, but the solution doesn't work properly for me (and I cannot comment yet).

A fasta file is a txt file using the following format. One line starting with a ">" marking the organisms name, followed by multiple lines with sequence data. A fasta file can contain multiple organisms each organised in blocks:

>Organism1
ACTGATGACTGATCGTACGT
ATCGATCGTAGCTACGATCG
ATCATGCTATTGTG
>Organism2
TACTGTAGCTAGTCGTAGCT
ATGACGATCGTACGTCGTAC
TAGCTGACTG
...

The code I wrote with help of the link above is:

data_file = open("multitest.fas","r")
data_tmp = []
a=[] #list for organisms name
b=[] #list for sequence data
for line in data_file:
    line = line.rstrip() 
    line = line.strip("\n").strip("\r") 
    for i in line:
        if line[0] == ">":
            a.append(line[1:])
            if data_tmp:
                b.append("".join(data_tmp))
                data_tmp=[]
            break
        else:
            line=line.upper()
    if all([k==k.upper() for k in line]):
        data_tmp.append(line)
print a
print b

The code works fine, EXCEPT that the sequence of the last organism is not appended to the list b. This seems obvious, as the sequence data is only added when a ">" is encountered. How can I make sure that also the last sequence is added? And why did nobody else has the same problem in the code of the above link? Thanks for any advice!


Solution

  • I've done it with Regex. Hope you find it helpful.

    >>> import re
    >>> data_file = open("multitest.fas","r")
    >>> data=data_file.read()
    >>> org=re.findall(r'>(\w*)',data) 
    >>> org
    ['Organism1', 'Organism2']
    >>> seq=[i.replace('\n','') for i in re.split(r'>\w*',data,re.DOTALL)[1:]]
    >>> seq
    ['ACTGATGACTGATCGTACGTATCGATCGTAGCTACGATCGATCATGCTATTGTG', 'TACTGTAGCTAGTCGTAGCTATGACGATCGTACGTCGTACTAGCTGACTG']