Search code examples
pythonbioinformaticsbiopython

Validate protein sequence


In some cases, I have sequences with characters that do not corresponds to proteins.

>ISAnsp8_orf1
MRKSRFTEEQIAHALRQVDAGVPAAELCRKLGISEQTFYAWKKKYAGMGIAEMRRVKQLEDENRRLKTLVADLTLDKHMLQEVLRKKF
>IS3_orf1
UGAAGAGCUGGCUAUCCUCCAAAAGGCCGCGACAUACUUCGCGAAGCGCC
>IS3_orf2
..............................(((((((((((......[[[
>IS3_orf3
UGAAAUGAAGUAUGUCUUUAUUGAAAAACAUCAGGCUGAGUUCAGCAUCA
>IS3_orf4
[[[..)))))))))))..............]]]]]]
>IS3_orf5
AAGCAAUGUGCCGCGUGCUCCGGGUGGCCCGCA
>IS3_orf7
MTKTVSTSKKPRKQHSPEFRSEALKLAERIGVTAAARELSLYESQLYNWRSKQQNQQTSSERELEMSTEIARLKRQLAERDEELAILQKAATYFAKRLK

Because I want to validate the sequences before to save in another file, I wrote this to test a validation method. That is rare because I used to different sequences, one including non-protein characters '(' but it still gives me the answer as True.

Testing all the three possibilities to 'sequence'the answer is the same (False)

import sys
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC, ProteinAlphabet

sequence = sys.argv[1]
#sequence = '((((((((((('
#sequence = 'TGEKPYVCQECGKAFNCSSYLSKHQR'

my_prot = Seq(sequence, alphabet=IUPAC.IUPACProtein)

print isinstance(my_prot.alphabet, ProteinAlphabet)     

if isinstance(my_prot.alphabet, ProteinAlphabet) == True:
  print 'ok' , isinstance(my_prot.alphabet, ProteinAlphabet)
else:
  print 'no'

Solution

  • Biopython currently does not provide alphabet validation when you initiate a Seq or similar object (the main reasons for this is the large performance cost). There is a lot discussion surrounding this and the situation may change in the future; in fact the first Biopython Enhancement Proposal (BEP) is about the use of alphabets in Biopython.

    Anyway, to solve your issue for now, there is a _verify_alphabet function buried in Biopython, although it's 'private', I see no reason not to use it:

    from Bio.Seq import Seq
    from Bio.Alphabet import IUPAC, _verify_alphabet
    
    sequences = ['TGEKPYVCQECGKAFNCSSYLSKHQR', '(((((((((((']
    
    for sequence in sequences:
        my_prot = Seq(sequence, IUPAC.protein)
        print(my_prot, _verify_alphabet(my_prot))
    

    Output (in Python 3.6 with Bio version 1.73dev):

    TGEKPYVCQECGKAFNCSSYLSKHQR True
    ((((((((((( False