Search code examples
pythonfasta

Parsing sequences from a FASTA file in python


I have a text file:

>name_1  
data_1  
>name_2  
data_2  
>name_3  
data_3  
>name_4    
data_4  
>name_5  
data_5  

I want to store header (name_1, name_2....) in one list and data (data_1, data_2....) in another list in a Python program.

def parse_fasta_file(fasta):
    desc=[]    
    seq=[]    
    seq_strings = fasta.strip().split('>')  
    for s in seq_strings:  
        if len(s):  
            sects = s.split()  
            k = sects[0]  
            v = ''.join(sects[1:])  
    desc.append(k)  
    seq.append(v)    

  for l in sys.stdin:  
  data = open('D:\python\input.txt').read().strip()  
  parse_fasta_file(data)
  print seq   

this is my code which i have tried but i am not able to get the answer.


Solution

  • The most fundamental error is trying to access a variable outside of its scope.

    def function (stuff):
        seq = whatever
    
    function('data')
    print seq   ############ error
    

    You cannot access seq outside of function. The usual way to do this is to have function return a value, and capture it in a variable within the caller.

    def function (stuff):
        seq = whatever
        return seq
    
    s = function('data')
    print s
    

    (I have deliberately used different variable names inside the function and outside. Inside function you cannot access s or data, and outside, you cannot access stuff or seq. Incidentally, it would be quite okay, but confusing to a beginner, to use a different variable with the same name seq in the mainline code.)

    With that out of the way, we can attempt to write a function which returns a list of sequences and a list of descriptions for them.

    def parse_fasta (lines):
        descs = []
        seqs = []
        data = ''
        for line in lines:
            if line.startswith('>'):
                if data:   # have collected a sequence, push to seqs
                    seqs.append(data)
                    data = ''
                descs.append(line[1:])  # Trim '>' from beginning
            else:
                data += line.rstrip('\r\n')
        # there will be yet one more to push when we run out
        seqs.append(data)
        return descs, seqs
    

    This isn't particularly elegant, but should get you started. A better design would be to return a list of (description, data) tuples where the description and its data are closely coupled together.

    descriptions, sequences = parse_fasta(open('file', 'r').read().split('\n'))
    

    The sys.stdin loop in your code does not appear to do anything useful.