Search code examples
pythonbioinformaticsbiopythonfasta

Estimate Alphabet in Biopython from fasta file


I am looking for a way to read a .fasta file in Biopython and have the package estimate if we are dealing with DNA, RNA or proteins. So far, I read data like this:

with open('file.fasta', 'r') as f:
    for seq in sio.parse(f, 'fasta'):
        # do stuff, depending on alphabet

My problem is now that I do not know what kind of sequences I will find in the .fasta file. It can be either proteins, DNA or RNA but I have to know the number of letters in the alphabet.

Is there a way to "estimate" the alphabet from the sequences with Biopython? I am aware that one could have a protein containing just the letters ACGT which is why I would like to estimate the alphabet.


Solution

  • This is very difficult to do for small sequences. For example, the sequence ACGCGACAGA can be both a DNA, an RNA or a protein sequence, since the letters A, C and G are common to all three alphabets. Without other knowledge it is impossible to estimate which is the best match.

    The following code will print out all the alphabets that the first record in a given FASTA file belongs to:

    from Bio import SeqIO
    from Bio.Alphabet.IUPAC import *
    
    alphabets = [extended_protein,  ambiguous_dna, unambiguous_dna, extended_dna, ambiguous_rna, unambiguous_rna]
    
    def validate(seq, alphabet):
        "Checks that a sequence only contains values from an alphabet"
        # inspired by https://www.biostars.org/p/102/
        leftover = set(str(seq).upper()) - set(alphabet.letters)
        return not leftover
    
    with open("example.fasta") as handle:
        first_record = list(SeqIO.parse(handle, "fasta"))[0]
        valid_alphabets = [str(alphabet) for alphabet in alphabets if validate(first_record.seq, alphabet)]
        print("Valid alpahabet(s) for fasta file: {}".format(', '.join(valid_alphabets)))
    

    So, for the sequence ACGCGACAGA this will print:

    Valid alpahabet(s) for fasta file: ExtendedIUPACProtein(), IUPACAmbiguousDNA(), IUPACUnambiguousDNA(), ExtendedIUPACDNA(), IUPACAmbiguousRNA(), IUPACUnambiguousRNA()
    

    But for the sequence MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFX it will print:

    Valid alpahabet(s) for fasta file: ExtendedIUPACProtein()