Search code examples
pythonbioinformaticsbiopythonfasta

Write parsed fasta file back to fasta format from a dictionary


I have created a function that parses a Fasta file because I needed to remove some odd characters. Now I have a dictionary and want to turn it back to a fasta format. I am new to Fasta files so I don't know how to proceed.

The dictionary has this format:

{'NavAb:/1126': 'TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVAISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI', 'Shaker:/1656': 'SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIPYFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL', .....

The function:

def parse_file(input_file):
parsed_seqs = {}
curr_seq_id = None
curr_seq = []
for line in newfile:
  line = line.strip()
  line = line.replace('-', '')
  if line.startswith(">"):
     if curr_seq_id is not None:
        parsed_seqs[curr_seq_id] = ''.join(curr_seq)
  curr_seq_id = line[1:]
  curr_seq = []
  continue

curr_seq.append(line)
parsed_seqs[curr_seq_id] = ''.join(curr_seq)
return parsed_seqs

newfile = open("file")
parsed_seqs = parse_file(newfile)
print(parsed_seqs)

Solution

  • If you can use an existing library for this task, you may use Biotite:

    import biotite.sequence.io.fasta as fasta
    
    seq_dict = {
        'NavAb:/1126': 'TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVAISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI',
        'Shaker:/1656': 'SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIPYFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL'
    }
    
    fasta_file = fasta.FastaFile()
    for header, seq_str in seq_dict.items():
        fasta_file[header] = seq_str
    fasta_file.write("path/to/file.fasta")
    

    path/to/file.fasta:

    >NavAb:/1126
    TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVA
    ISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI
    >Shaker:/1656
    SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIP
    YFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL
    

    Note that I belong to the developers of this package. There are also solutions in a multitude of other packages, such as Biopython.