Search code examples
pythondictionaryfasta

Multiple dict values generated from multi-line FASTA


I'm trying to generate a library of sequence identifiers and the subsequent sequence (as key, value pairs, respectively) from a FASTA file and have run into a problem that my novice programming brain couldn't solve.

In short, my multi-line FASTA file (an ex. shown below) is being stored as multiple values per key. Each new line in the FASTA file produces a new value, as opposed to the entire sequence being stored as a single value per sequence identifier.

My code is below, and the example FASTA file from which I'm pulling is below that. Any help on how to make the entire sequence be stored as a single value and not multiple values would be helpful! Looks like I have lots of reading to keep doing...

Thanks in advance for any help!

import sys
sequence = ''
fasta = {}
def seqs_from_file(filename):
    with open(filename) as f:
        for line in f:
            line = line.rstrip("\n")
        if not line:
            continue
        if line.startswith(">"):
            seq_name = line[1:]
            if seq_name not in fasta:
                fasta[seq_name] = []
            continue
        sequence = line
        fasta[seq_name].append(sequence)
print(fasta) # printing here is just so I can see if my dict. was correctly made

Ex from FASTA file:

>646311950
ATGAATAATCGAGTCCACCAGGGCCACTTAGCCCGTAAACGCTTCGGGCA
AAACTTTCTCAACGATCAGTTCGTGATCGACAGTATTGTGTCTGCCATTA
ACCCGCAAAAGGGCCAGGCGATGGTCGAAATCGGCCCCGGTCTGGCGGCA
TTGACCGAACCGGTCGGCGAACGTCTGGACCAGCTGACGGTCATCGAACT
TGACCGCGATCTGGCGGCACGTCTGCAAACGCATCCATTCTTAGGCCCGA
AACTGACGATTTATCAGCAGGATGCGATGACCTTTAACTTTGGTGAACTG
GCCGAGAAAATGGGTCAGCCGCTGCGTGTTTTCGGCAACCTGCCTTATAA
CATCTCCACGCCGTTGATGTTCCATCTGTTTAGCTATACTGATGCCATTG
CCGACATGCACTTTATGTTGCAAAAAGAGGTGGTGAATCGTCTGGTTGCA
GGACCGAACAGCAAAGCGTATGGTCGATTAAGCGTCATGGCGCAATACTA
TTGCAATGTGATCCCGGTACTGGAAGTACCGCCGTCAGCCTTTACACCAC
CACCCAAAGTGGATTCCGCCGTCGTGCGCCTGGTTCCTCATGCAACGATG
CCTCACCCGGTTAAAGATGTTCGTGTGTTGAGCCGCATCACCACCGAAGC
CTTTAACCAGCGTCGTAAAACCATTCGTAACAGCCTCGGCAACCTGTTTA
GCGTCGAGGTGTTAACGGGAATGGGGATCGACCCGGCGATGCGAGCGGAA
AATATCTCTGTCGCGCAATATTGCCAGATGGCGAACTATCTGGCGGAGAA
CGCGCCTTTGCAGGAGAGTTAA

Solution

  • Your line-processing logic should be indented inside the for loop instead, and instead of appending sequence to fasta[seq_name] as a list, you should concatenate sequence to it as a string if you want it to be one value:

    import sys
    sequence = ''
    fasta = {}
    def seqs_from_file(filename):
        with open(filename) as f:
            for line in f:
                line = line.rstrip("\n")
                if not line:
                    continue
                if line.startswith(">"):
                    seq_name = line[1:]
                    if seq_name in fasta:
                        fasta[seq_name].append('')
                    else:
                        fasta[seq_name] = ['']
                    continue
                sequence = line
                fasta[seq_name][-1] += sequence
        print(fasta)