Search code examples
pythonstring-concatenation

Concatenating lines to a string in python


I have a fasta file as follows:

>scaf1
AAAAAATGTGTGTGTGTGTGYAA
AAAAACACGTGTGTGTG
>scaf2
ACGTGTGTGTGATGTGGY
AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK
>scaf3
AAAGTGTGTTGTGAAACACACYAAW

I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:

{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}

The script I have written is:

import sys
from collections import defaultdict

fastaseq = open(sys.argv[1], "r")

def readfasta(fastaseq):
    fasta_dict = {}
    for line in fastaseq:
        if line.startswith('>'):
            header = line.strip('\n')[1:]
            sequence = ''
        else:
            sequence = sequence + line.strip('\n')
        fasta_dict[header] = sequence 
    return fasta_dict

fastadict = readfasta(fastaseq)
print fastadict

It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?


Solution

  • Concatenating strings with + requires to create a new string since Python strings are immutable, which is time consumer.

    Use str.join to concatenate them after all strings are ready,

    import sys
    
    def read_fasta(filename):
        fasta_dict = {}
        l = list()
        header = None
        with open(filename, 'r') as f:
            for line in f:
                if line.startswith('>'): # a new record
                    # save the previous record to the dict
                    if header:
                        fasta_dict[header] = ''.join(l) 
                        del l[:]    # empty the list
    
                    header = line.strip().split('>')[1]
                else:
                    l.append(line.strip())
    
            # save the last record
            fasta_dict[header] = ''.join(l) 
    
        return fasta_dict
    
    fastadict = read_fasta(sys.argv[1])
    print(fastadict)