Search code examples
pythonfastablast

Processing files line-by-line


I am processing a large BLAST file together with a large FASTA file, and I need to load several lines of FASTA for one block of BLAST (let's say it is one line).

I would expect that in the second loop (line) of BLAST, it will continue on the next line from the last-processed FASTA line, but it is loading all the same FASTA lines. Why? And how can I load the next line? Is it really necessary to add some indexing?

with open(fastaname,'r') as fastafile:
  with open(blastfilename,'r') as blastfile:
    for line in blastfile:
      while True:  
        fastaline = fastafile.readline()[:-1]  
            if fastaline[0]=='>':  
                break  
            fastaseq += fastaline
      somefunction(line,fastaseq)

FASTA has the typical format:

>name_of_seqence\n  
ACGATCATCGTAGCTGCATGACTGCA\n  
GATCGATCTGATCGATGCAGTCAGTA\n
>name_of_seqence\n  
GCACGCGACCACGATCATTGACTGCA\n  
CAAAAGATCTGATCGATGCAGTCAGT\n
CAGTCGATGCTAGTCGATGCTCGATA\n
etc.

I need each sequence for every line of the next BLAST sequence.


Solution

  • I don't see anything that would ever clear the lines you're appended to fastaseq after finishing a line from your blast file. Try this:

    with open(fastaname,'r') as fastafile:
        with open(blastfilename,'r') as blastfile:
            for line in blastfile:
                fastaseq = '' # or whatever data type this is 
                while True:  
                    fastaline = fastafile.readline()[:-1]  
                    if fastaline[0]=='>':  
                        break  
                fastaseq += fastaline
            somefunction(line,fastaseq)
    

    As it is your first blast line would get the first set of fasta lines, the second blast line would get the first and second sets, etc etc.