Search code examples
python-3.xsortingbioinformaticsfasta

sort fasta by sequence size


I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):

>id1

sequence 1 # could be on several line

>id2

sequence 2

...

I have run a tools that give me in tsv format:

the Identifiant, the length, and the position in bytes of the identifiant.

for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.

# this fonction will get the sequence using seek
def get_seq(file, bites):  

    with open(file) as f_:
        f_.seek(bites, 0) # go to the line of interest
        line = f_.readline().strip() # this line is the begin of the 
                                     #sequence
        to_return = "" # init the string which will contains the sequence

        while not line.startswith('>') or not line:  # while we do not 
                                                     # encounter another identifiant
        to_return += line
        line = f_.readline().strip()

    return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):

    with open(out_file, 'a') as out_file:
        out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))

# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:

    indice = 0

    for line in ref:

        spt = line.split()
        id_ = spt[0]
        seq = get_seq(args.i, int(spt[2]))
        write_seq(out_file=args.out, id_=id_, sequence=seq)

my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?


Solution

  • Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:

    subprocess.call(["samtools", "faidx", args.i])
    with open(args.fai) as ref:
    
        for line in ref:
    
            spt = line.split()
            id_ = spt[0]
            subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)