Search code examples

Concatenating lines to a string in python

I have a fasta file as follows:


I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:


The script I have written is:

import sys
from collections import defaultdict

fastaseq = open(sys.argv[1], "r")

def readfasta(fastaseq):
    fasta_dict = {}
    for line in fastaseq:
        if line.startswith('>'):
            header = line.strip('\n')[1:]
            sequence = ''
            sequence = sequence + line.strip('\n')
        fasta_dict[header] = sequence 
    return fasta_dict

fastadict = readfasta(fastaseq)
print fastadict

It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?


  • Concatenating strings with + requires to create a new string since Python strings are immutable, which is time consumer.

    Use str.join to concatenate them after all strings are ready,

    import sys
    def read_fasta(filename):
        fasta_dict = {}
        l = list()
        header = None
        with open(filename, 'r') as f:
            for line in f:
                if line.startswith('>'): # a new record
                    # save the previous record to the dict
                    if header:
                        fasta_dict[header] = ''.join(l) 
                        del l[:]    # empty the list
                    header = line.strip().split('>')[1]
            # save the last record
            fasta_dict[header] = ''.join(l) 
        return fasta_dict
    fastadict = read_fasta(sys.argv[1])