Search code examples
pythonbiopythongenbank

Biopython parsing a GBK file without genome sequence


I wrote a script that uses a GenBank file and Biopython to fetch the sequences of given genes from the sequence part of the GBK file, which my colleagues use for their work.

We had some problems now with a new data set, and it turned out that the GBK file that was downloaded did not contain a sequence (which can easily happen when you download from the GenBank website at NCBI). Instead of throwing an error, Biopython returns a long sequence of Ns when using record.seq[start:end]. What is the easiest way to catch that problem right from the start to stop the script with an error message?


Solution

  • Right, I found a way. If I count the Ns in the sequence and check if there are as many as the sequence is long, I know that the sequence is missing:

    import sys
    from Bio import SeqIO    
    
    for seq_record in SeqIO.parse("sequence.gb", "genbank"):
      sequence = seq_record.seq
      if len(sequence) == sequence.count("N"):
        sys.exit("There seems to be no sequence in your GenBank file!")
    

    I would have preferred a solution that checks the sequence type instead, since the empty sequence is Bio.Seq.UnknownSeq, instead of Bio.Seq.Seq for a real sequence, and would be thankful if anyone can suggest something in that direction.

    Update

    @xbello made me try again to check the sequence type, now this also works:

    import sys, Bio
    from Bio import SeqIO    
    
    for seq_record in SeqIO.parse("sequence.gb", "genbank"):
      sequence = seq_record.seq
      if isinstance(sequence, Bio.Seq.UnknownSeq):
        sys.exit("There seems to be no sequence in your GenBank file!")